We propose a compact coprocessor for the AES (encryption, decryption, and key expansion) and the cryptographic hash function ECHO on Virtex-5 and Virtex-6 FPGAs. Our architecture is built around an 8-bit datapath. The Arithmetic and Logic Unit performs a single instruction that allows for implementing AES encryption, AES decryption, AES key expansion, and ECHO at all levels of security. Thanks to a careful organization of AES and ECHO internal states in the register file, we manage to generate all read and write addresses by means of a modulo-16 counter and a modulo-256 counter. A fully autonomous implementation of ECHO and AES on a Virtex-5 FPGA requires 193 slices and a single 36k memory block, and achieves competitive throughputs. Assuming that the security guarantees of ECHO are at least as good as the ones of the SHA-3 finalists BLAKE and Keccak, our results show that ECHO is a better candidate for low-area cryptographic coprocessors. Furthermore, the design strategy described in this work can be applied to combine the AES and the SHA-3 finalist Grøstl.
function ECHO [5] on Virtex-5 and Virtex-6 Field-Programmable Gate Arrays (FPGAs). Our coprocessor implements AES encryption, AES decryption, AES key expansion, and ECHO at all levels of security. This architecture might for instance be extremely valuable for constrained environments such as wireless sensor networks or radio frequency identification technology, where some security protocols mainly rely on cryptographic hash functions (see for example [30] ). Several cryptographic protocols combine public-key cryptography (PKC) (e.g. RSA, elliptic curve cryptography (ECC), etc.), hash functions, random number generators, and symmetric encryption/decryption algorithms. Consider for instance the BLS short signature scheme [10] : in order to verify a signature, one has to hash the message and compute two bilinear pairings on an elliptic curve. Each pairing constitutes a time-consuming task: the best coprocessors for embedded systems compute the Tate pairing over 128-bitsecurity curves in more than 2 ms [1, 17] . Therefore, one has more than 4 ms in order to hash the next message while computing the two bilinear pairings for the current message. In this context, it is also important to keep the amount of hardware resources for the hash function as small as possible (i.e. it is pointless to design a massively parallel coprocessor able to hash a message in far less than 4 ms).
After a short description of the AES (Sect. 2) and the ECHO family of hash functions (Sect. 3), we propose a unified coprocessor built around an 8-bit datapath (Sect. 4). We have prototyped our architecture on Virtex-5 and Virtex-6 FPGAs and discuss our results in Sect. 5.
The Advanced Encryption Standard
The round transformation of the AES operates on a 128-bit intermediate result, called state. The state is internally represented as a 4 × 4 array of bytes A: Each byte a i, j , 0 ≤ i, j ≤ 3, is considered as an element of F 2 8 ∼ = F 2 [x]/(m(x)), where the irreducible polynomial is given by m(x) = x 8 + x 4 + x 3 + x + 1. In the following, we encode an element of F 2 8 by two hexadecimal digits: for instance, 89 is equivalent to x 7 + x 3 + 1 in the polynomial basis representation. We denote the jth column of A by A j . The number of rounds N r as well as the number of 32-bit blocks in the cipher key N k of the AES depend on the desired security level ( Table 1) . The AES involves four byte-oriented transformations and their inverses for encryption and decryption, respectively [13] :
-The SubBytes step is the only non-linear transformation of the AES. Its role is to introduce confusion to the data so that the relationship between the secret key and the ciphertext is obscured. It updates each byte of the state using an 8-bit S-box, denoted by S RD . Internally, the AES S-box computes the modular inverse of a i, j (the value 00 is mapped onto itself) and then applies an affine transformation. The inverse transformation, called InvSubBytes and denoted by S −1 RD , performs the inverse affine transformation followed by an inversion over F 2 8 -The MixColumns step is a permutation operating on the AES state column by column. Together with ShiftRows, this step provides diffusion in the cipher: if a single bit of the plaintext is flipped, then the whole ciphertext should be changed. Each column of the AES state is considered as a polynomial over F 2 8 , and is multiplied modulo y 4 + 01 by the constant polynomial c(y) = 03 · y 3 + 01 · y 2 + 01 · y + 02 [13] . This operation is performed by multiplying each column of the state A by a circulant matrix M E : 
where 0 ≤ j ≤ 3. During the inverse operation, called InvMixColumns, each column of the state is multiplied by
is the multiplicative inverse of c(y) modulo y 4 + 1:
Here again, the modular multiplication by a constant polynomial can be defined by a matrix multiplication:
-The AddRoundKey step combines the state A with a 128-bit round key. Let r denote the round index. Each byte k i,4r + j of the round key and its corresponding byte a i, j are added in F 2 8 by a simple bitwise XOR operation. AddRoundKey is therefore its own inverse.
Key expansion
The round keys involved in the AddRoundKey steps are derived from the cipher key as follows. Let us consider an array consisting of 4 rows and 4 · (N r + 1) columns. Each column K j contains four elements of F 2 8 denoted by k 0, j , k 1, j , k 2, j , and k 3, j . The round key of the jth round of the AES encryption algorithm is given by columns K 4 j to K 4 j+3 (Fig. 1) . The cipher key is copied in the first N k columns of the array, and the next columns are defined recursively. The process, summarized by Algorithms 1 and 2, involves an 8 and a permutation matrix P defining a cyclic rotation of the bytes within a column:
00 01 00 00 00 00 01 00 00 00 00 01 01 00 00 00
We denote the identity matrix by I. This matrix notation will be useful to pinpoint a unified 8-bit datapath for key expansion, encryption, and decryption in Sect. 4.1.
Algorithm 1 AES key expansion for
RC ← x · RC; 7: else 8:
Encryption
After an initial AddRoundKey step, an AES encryption involves N r − 1 repetitions of a round composed of the four byte-oriented transformations described above. Eventually, a final encryption round, in which the MixColumns step is omitted, produces the ciphertext (Algorithm 3). Noting that the order of ShiftRows and SubBytes is indifferent [13] , we obtain the datapath depicted on Fig. 2 .
Algorithm 2 AES key expansion for
RC ← x · RC; 7: else if j mod N k = 4 then 8:
9: else 10:
Algorithm 3 updates the AES state column by column. Since the ShiftRows transformations performs cyclical left shifts of the three bottom rows of the state, we have to be careful not to overwrite bytes that are still involved in the forthcoming MixColumns steps (a 1,0 is for instance needed to update the fourth column of the AES state, and should not be overwritten when updating the first column). Thus, the encryption algorithm requires an internal 4 × 4 array of bytes B.
Decryption
We consider here the equivalent decryption algorithm described in [13, Section 3.7 .3] (Algorithm 4). Its main advantage over the straightforward decryption process is that encryption and decryption rounds share the same datapath (Fig. 2) . Nevertheless, the round keys are introduced in reverse order for decryption. 
7: end for 8: A ← B; 9: end for 10: for j = 0 to 3 do
12: end for 13: Return B;
The hash function ECHO
The ECHO family of hash functions [5] is built around the round function of the AES. This design strategy allows one to easily exploit advances in the implementation of the AES, such as the new AES instruction set of Intel Westmere processors [6] . ECHO is a family of four hash functions, namely ECHO-224, ECHO-256, ECHO-384, and ECHOAlgorithm 4 AES decryption. 
9: end for 10: for j = 0 to 3 do
12: end for 13: Return B; 512 ( Table 2 ). The main differences lie in the length of the chaining variable and in the number of rounds.
In this work, we assume that our coprocessor is provided with a padded message M. We refer the reader to [5, Section 2.2] for a description of the padding step. A hardware wrapper interface for ECHO (and several other 
where AESROUND denotes one round of the AES encryption flow. As explained in Sect. 2.2, an internal 4 × 4 array of bytes B ( j) is needed to solve data dependency issues (Algorithm 5, lines 11 and 12). The key schedule for the derivation of the two 128-bit subkeys k 1 and k 2 is much simpler than the one of the AES. k 1 is related to the number of unpadded message bits C i hashed at the end of the current iteration. An internal 64-bit counter κ is initialized with the value of C i , and k 1 is defined as follows:
κ is incremented at the end of each AES round involving k 1 . If the size of the message exceeds 2 64 − 1, one has the flexibility to use a 128-bit counter C i . k 2 is equal to the 128-bit salt value that enables ECHO to support randomized hashing. -The BIG.ShiftRows step is the analogue of the ShiftRows step of the AES. The first line of the internal state is left unchanged. Each 128-bit word of the second, third, and fourth lines is left-rotated by one, two, and three positions, respectively. At the byte level, this transformation is given by: for i = 0 to 3 do 17:
for j = 0 to 3 do 18: 
; 29: end for 30: else 31: for i = 0 to 7 do 32:
-The BIG.MixColumns step operates on the ECHO state column by column. We build a polynomial over F 2 8 by picking the (i + 4 j)th byte of each AES state in the kth column, and apply to it the MixColumns transformation: 
where 0 ≤ i, j, k ≤ 3. We combine the BIG.ShiftRows and BIG.MixColumns steps (Algorithm 5, line 18), and avoid data dependency issues thanks to intermediate variables
calls to the compression function, the BIG.Final step generates the new value of the chaining variable V i from V i−1 , M i , and the internal state. Note that this step depends on the selected level of security (Algorithm 5, lines 26 to 34).
A compact unified coprocessor for the AES and the ECHO family of hash functions

A unified arithmetic and logic unit
Since our objective is to develop a low-area coprocessor for the AES and the ECHO family of hash functions, it seems natural to consider an 8-bit datapath (Fig. 5) . Above all, note that the ShiftRows and InvShiftRows steps are implemented by accordingly addressing the register file organized into bytes. As a result, these operations are virtually for free and do not require dedicated hardware in the Arithmetic and Logic Unit (ALU). We can now describe key expansion (Algorithms 1 and 2), encryption (Algorithm 3), and decryption (Algorithm 4) using a single instruction:
where -R i , R j , and R k are vectors of four bytes; -f is a function applied to each byte of R i ; -A and B are 4 × 4 matrices of bytes.
The values of these parameters for the different steps of Algorithms 1, 2, 3, and 4 are summarized in Table 3 . The hash function ECHO benefits from the same instruction: the BIG.SubWords consists of AES rounds, and the BIG.MixColumns step involves the circulant matrix M E . Only the key schedule and the BIG.Final step require a small additional amount of hardware.
The subBytes and invSubBytes steps
The SubBytes and InvSubBytes steps are often considered as the most critical part of the AES and several architectures 
for S RD and S −1 RD have already been described in the literature (see for instance [20] for a comprehensive bibliography). On Xilinx Virtex-5 and Virtex-6 FPGAs, the best design strategy consists in implementing the AES S-boxes as 8-input tables [12] . Two control bits ctrl 1:0 allow us to perform SubBytes, InvSubBytes, or to bypass this stage when f is the identity function.
Matrix multiplication
A quick look at Table 3 indicates that matrix A in Eq. (1) can be any of the four matrices introduced in Sect. 2. Two control bits ctrl 3:2 are therefore necessary to select the desired operation. Since we emphasize reducing the usage of FPGA resources, we adopt the multiply-and-accumulate approach proposed by Hämäläinen et al. [24] , and need 4 clock cycles to multiply one column of the state or the round key array by a 4 × 4 circulant matrix (Fig. 6 ). Let us consider the product M E · A j . We compute a first partial product by multiplying each coefficient of the fixed polynomial 01 + 01 · y + 03 · y 2 + 02 · y 3 by a 0, j , and store the result in registers r 0 , r 1 , r 2 , and r 3 . Then, at each clock cycle, the intermediate result is rotated and accumulated with a new partial product. This process involves a control signal to distinguish between the first step and the subsequent ones. Such a signal can be generated by computing the bitwise OR of the two bits of a modulo-4 counter.
A standard way to implement the AES consists in taking advantage of the well-known relation between the MixColumns and InvMixColumns polynomials [13, p. 55]:
However, multiplication by 04y 2 + 05 would incur extra clock cycles for decryption (i.e. a different instruction flow for encryption and decryption). In order to keep the instruction memory of our coprocessor as small as possible, it is crucial to use the same code for encryption and decryption. A status register indicates which algorithm is currently executed, and the control unit generates the control bits ctrl 3:0 accordingly. Our algorithm for multiplication by M D is based on the following observation [29] : 
according to the 2 control bits ctrl 3:2 . Since the computation of each digit of 02 · a(x) and 03 · a(x) requires at most 3 coefficients of a(x) (Table 4b), this operation can be implemented by means of 8 LUT6_2 primitives. Figure 7 describes how we implement multiplication by I, M E , M D , and P by combining the outputs of those tables. One easily checks that this circuit is equivalent to the one illustrated in Fig. 6 . In particular, note that the content of registers r 2 and r 3 is given by: 
and
if ctrl 3:2 = 11.
Our matrix multiplication unit involves 16 LUT6_2 primitives and 32 LUT6 primitives, resulting in a total requirement of 12 slices on a Virtex-5 FPGA. Compared to the MixColumns operator of ECHO-256 coprocessor described in [8] , where only multiplication by M E is needed, the hardware overhead amounts to 4 Virtex-5 slices. Matrix B is either the identity matrix I or the InvMixColumns matrix (Table 3) . We followed a similar strategy to implement multiplication by B. Figure 8 describes the component we designed to perform the AddRoundKey step. Since our matrix multiplication units output 4 bytes, we perform 4 additions over F 2 8 in parallel and store the result in a shift register. This approach allows us to write data byte by byte in the register file. Here again, a simple modulo-4 counter controls the process: a new result is loaded during the first clock cycle, and then shifted in the three subsequent clock cycles. The same component performs the additions involved in the round key derivation. However, additional hardware resources are needed to: A multiplexer controlled by ctrl 6 selects the operand loaded in the register when the clock enable signal ctrl 7 is equal to 1: the initial value 01 or x · RC. When i mod N k = 0, the control unit sets ctrl 8 to 1 so that RC is added to k 0,i .
Addition over F 2 8
Recall that the BIG.MixColumns step does not involve any round key addition (see Algorithm 5, line 18 and Table 3 ).
In order to use the same datapath for this operation, we add the constant 00 stored in the key memory of the coprocessor.
During the BIG.Final step, two bytes are read from the register file at each clock cycle, and accumulated thanks to the feedback mechanism controlled by ctrl 4 and ctrl 5 (here again, the signal sk 0 is obtained by reading the constant 00 from the key memory). Thus, the computation of each byte of V (i) involves four and two clock cycles for ECHO-224/256 and ECHO-384/512, respectively. All other operations require six clock cycles (Fig. 5) . Therefore, special attention must be paid to the design of the control unit in order to take the latency of each operation into account.
ECHO key schedule
The choice of an 8-bit datapath enables to increment the internal 64-bit counter κ in 8 clock cycles, thus keeping the critical path of the adder as small as possible. Figure 5 describes the pipelined adder implementing ECHO key schedule. k 1 is stored in the key memory and is read byte by byte. During the first clock cycle, we add the constant 1 to the least significant byte of k 1 and store the output carry in a flip-flop. This carry bit is then added to the second byte of k 1 , and the content of the flip-flop is updated accordingly. We repeat this process until the 64 least significant bits of k 1 are updated. Since the 8 most significant bytes of k 1 are not modified, we simply add the constant 0 in the remaining clock cycles. A modulo-16 
Memory organization
Since we consider an 8-bit datapath, the memory of our coprocessor is organized into bytes. We will show below that 10 address bits are needed to access message blocks and intermediate data, thus allowing us to implement the register file and the key memory by means of a single Virtex-5 or Virtex-6 block RAM configured as two independent 18 Kb RAMs (Fig. 5) .
Register file. Recall that an ECHO state is an array of 256 bytes a (k)
i, j , where 0 ≤ i, j ≤ 3 and 0 ≤ k ≤ 15 (Fig. 3) . Let us define the 8-bit address of a (k) i, j as 16k + 4 j + i (i.e. the 4 most significant bits encode the index k, and the 4 least significant bits define the location of the byte in the AES state A (k) ).
We decided to organize the register file into four blocks of 256 bytes selected by two additional address bits (Fig. 9) . In order to implement ECHO according to Algorithm 5, we need a first 4 × 4 array of AES states to store the chaining variable and the message block. The compression function involves two additional arrays (ECHO states A and B in Algorithm 5). We use the 128 least significant bits of A and (Fig. 1) .
Key memory. Besides a copy of the AES round keys, the key memory contains k 1 , k 2 , and a block whose all bytes are set to zero which provides us with the constant 00 needed for the BIG.MixColumns and BIG.Final steps (Sect. 4.1.3). Thus, no dedicated hardware is needed to force sk 0 , sk 1 , sk 2 , and sk 3 to 00.
In the following, we show that our careful organization of the data in the register file and in the key memory allows one to design a control unit based on a 4-bit counter, an 8-bit counter, and a simple Finite State Machine (FSM).
Control unit
The control bits of our unified ALU, the read and write addresses of the register file and the key memory, and the write enable signals are computed by a control unit that mainly consists of an address generator and an instruction memory. A FSM, four internal registers, and a stack allow us to select and execute the algorithm specified by the user.
Address generation
The address generation process is the most challenging task in the design of a low-area unified coprocessor for the AES and the hash function ECHO: at first glance, it seems that each task (AES key expansion, AES encryption, AES decryption, BIG.MixColumns, etc.) requires a different addressing scheme. However, we described a way to generate the eight least significant bits of all read and write addresses of ECHO-256 by means of a counter by 5 modulo 16 and a modulo-256 counter [8] . We show here that our address generator can be slightly modified in order to support ECHO-512 and the AES (Figs. 10 and 11 ). Note that our control unit generates at each clock cycle a read address and its corresponding write address. Since our coprocessor embeds several pipeline stages (Fig. 5) , it is necessary to delay write addresses and write enable signals accordingly. Shift registers allow us to synchronize signals in our coprocessor. On Xilinx devices, they are efficiently implemented by means of SRL16 primitives, whose depth is dynamically adjusted according to the algorithm being executed (Fig. 10c) : the latency of the BIG.Final step is equal to six and four clock cycles for ECHO-224/256 and ECHO-384/512, respectively. In all other cases, the datapath includes eight pipeline stages. Figure 10 describes the generation of the write enable signals and the two most significant bits of read and write addresses. The architecture is fairly simple in the case of the key memory: two control bits ctrl 6:5 allows for selecting one of the four blocks of 256 bytes. For a given algorithm, read and write operations always occur in the same block and share the same two most significant address bits. Since the BIG.Final step does not modify the key memory, an 8-stage FIFO allows for synchronizing the write address and the write enable signal.
The register file needs more careful attention. Recall that 128-bit plaintext or ciphertext blocks, chaining variables and message blocks are stored in the first block of 256 byte of the register file (Fig. 9) . The first intermediate variables are written in the second block. Thus, the two most significant bits of read and write addresses must be set to 00 and 01, respectively. This task is performed thanks to two multiplexers controlled by ctrl 10:9 and ctrl 8:7 . Then, read and write operations alternate between the second and the third blocks of 256 bytes. It suffices to flip the bits of the write address. In the case of the read address, we wish to generate the sequence 00 → 01 → 10 → 01 → . . .. Let a 1:0 denote the two most significant bits of the current read address. We easily check that we obtain the next read address b 1:0 by computing b 0 ←ā 0 ∨ a 1 and b 1 ← a 0 . Of course it would have been possible to add a fourth input to the multiplexer controlled by ctrl 10:9 in order to set the read address to 01. Then, it suffices to flip the address bits to switch between the second and the third memory block. However, this approach would imply two distinct instructions to switch from the first to the second block, and between the second and the third blocks, thus increasing the size of the instruction memory. Figure 11 describes how we generate the eight least significant bits of read and write addresses (i.e. the location of a byte in a block of 256 bytes). Figure 12 illustrates the scheduling of the AES-128 key expansion algorithm. Since N r ≤ 14, the round key array contains at most 240 bytes, and we can use the modulo-256 counter to process it byte by byte (Algorithm 1): a new byte k j,i of the array is computed from k j,i−N k and k j,i−1 . Recall that the address of k j,i−N k is given by j + 4i − 4N k and assume that it is provided by the modulo-256 counter. It suffices to increment the counter by 4 · (N k − 1) and 4N k to obtain the addresses of k j,i−1 and k j,i , respectively. Our address generator is provided by N k − 1 and a 6-bit adder allows us to increment the current value of the modulo-256 counter by 4 · (N k − 1) (Fig. 11) . Since 
AES key schedule.
N k = 2 · (((N k − 1) div 2) + 1),
AES encryption.
Recall that the ShiftRows step is implemented by accordingly addressing the register file (Sect. 4.1) and that the order in which bytes are processed during the first AddRoundKey step does not matter. In order to update a column of the AES state, we have to read a 0, j , a 1,( j+1) mod 4 , a 2,( j+2) mod 4 , and a 3,( j+3) mod 4 , where 0 ≤ j ≤ 3 (Algorithm 3). During an encryption round, the control unit performs the following tasks ( Fig. 13): -Read a byte of the AES state from the register file. Starting from 0 (i.e. the address of a 0,0 ), we generate all read addresses thanks to a counter by 5 modulo 16. -Read a byte of the round key from the key memory. The modulo-256 counter allows us to process the round key array column by column. -Update one byte of the AES state. Since the AES state is updated column by column, the address is given by the 4 least significant bits of the modulo-256 counter.
In order to update the value of a 3,3 , we have to provide our ALU with a 0,3 , a 1,0 , a 2,1 , and a 3,2 . Our control unit will generate the address of a 3,2 (read operation) and a 3,3 (write operation) at time t. Since our coprocessor include D = 8 pipeline stages, we will write the new value of a 3,3 in the register file at time t + D (Fig. 14) . Therefore, we have to wait D − 3 = 5 clock cycles before starting the next encryption round. Then, we read a 0,0 at time t + D − 2, a 1,1 at time t + D − 1, a 2,2 at time t + D, and a 3,3 at time t + D + 1, thus satisfying constraints implied by data dependencies. Each encryption round requires 16 + D − 3 = 21 clock cycles. It is possible to relax this constraint by interleaving two (or more) AES encryptions. However, this approach works only in the case of a chaining mode without output feedback or during the BIG.SubWords step of ECHO, where we process 16 AES states.
AES decryption. Two simple modifications of the AES encryption addressing scheme allow us to decrypt a ciphertext block (Fig. 15): -In order to perform InvShiftRows instead of ShiftRows, it suffices to increment the modulo-16 counter by 13 instead of 5. Therefore, only the most significant bit of the offset depends on the algorithm. -The 128-bit round keys must be introduced in reverse order: the jth step of decryption involves the (N r − j)th round key (0 ≤ j ≤ N r ). Since the 16 bytes of round key j are stored from address 16 j to 16 j + 15 ( Fig. 1) , we have to modify the four most significant bits of the address in order to perform decryption. Furthermore, N r is always even, and the least significant bit of N r − j has the same value as the one of j. Thus, we can compute the three most significant bits of N r − j by means of three look-up tables addressed by j.
The control unit embeds an internal register that indicates which algorithm is executed. The most significant bit of the offset as well as the control signals of the multiplexers selecting the read address of the key memory depend only on the content of this register. Thanks to this design strategy, the ECHO. Figure 16 describes the address generation process of ECHO. The only difference between BIG.SubWords and AES encryption is that we now have to process 16 AES states. The four most significant bits of the address are therefore given by the four most significant bits of the modulo-256 counter. The BIG.Final step requires careful attention: in order to speed up this operation, we read a byte of the chaining variable or of the message block on the first port of the register file, and a byte of the internal state (i.e. the output of the last round) on the second one. We describe this process on Fig. 16 in the case of ECHO-224/256. Modifying the scheduling for ECHO-384/512 is straightforward.
Instruction memory
We implemented two mechanisms in our control unit in order to keep the size of the instruction memory as small as possible: -Nested loops. Consider for instance AES encryption:
since the number of rounds N r depends on the desired level of security, we need a loop instruction in order to share the same code between AES-128, AES-192, and AES-256. When encryption starts, the value of N r is loaded in one of the four internal registers of the control unit. The loop instruction will therefore include the address of the register. A nested loop is then needed to process all the columns of the AES state. The number of iterations is the same, regardless of the chosen security level, and can be specified in the loop instruction. Therefore, we implemented two addressing modes (absolute and register indirect). Each time a loop instruction is executed, the return address and the number of iterations are pushed onto a stack. Thanks to these mechanisms, the instruction memory contains only 3 algorithms: AES key expansion (58 instructions), AES encryption/decryption (26 instructions), and ECHO (36 instructions).
Results and comparisons
We captured our architecture in the VHDL language and prototyped our coprocessor on Virtex-5 and Virtex-6 FPGAs with average speedgrade. Tables 5 and 6 summarize the place-and-route results measured with ISE 12.3 and the throughput of each algorithm implemented, respectively. It is of course possible to reduce the number of slices by implementing a subset of the functionalities (e.g. a single level of security, AES without key expansion, etc.). -Feldhofer et al. [18] have introduced a protocol based on the AES for authenticating an RFID tag to a reader device. The challenge was to propose a low-power AES-128 encryption core suitable for RFID tags. In order to keep the number of registers as small as possible, round keys are computed just in time by using the S-box and the XOR functionality of the datapath. The coprocessor needs 1016 clock cycles for the encryption of a 128-bit plaintext block (including key expansion). Our approach involves a smaller number of clock cycles, however it would be unfair to make a comparison between an architecture optimized for RFID tags (0.35 µm CMOS process) and a coprocessor taking advantage of the features of today's FPGA technology. -Good and Benaissa [22, 23] have proposed an 8-bit Application Specific Instruction Processor (ASIP) for AES-128. They defined a minimal set of instructions to perform the operations required by the AES and the control unit mainly consists of a program ROM, an instruction decoder, and a program counter. Their coprocessor needs 122 Spartan-II slices and is therefore more compact than our architecture. The average throughput for encryption and decryption (including the key schedule that is performed on-the-fly) is equal to 2.18 Mbps (3691 clock cycles are needed to encrypt a 128-bit plaintext block). On a Virtex-5 FPGA, the same design would achieve much better performance: the clock frequency would be higher (Xilinx produces the Virtex-5 family in a 65 nm CMOS process, whereas the Spartan-II family was based on a 0.18 µm CMOS technology) and the number of slices would be roughly divided by two (a Virtex-5 slice contains four function generators configurable as 6-input LUTs or dual-output 5-input LUTS, whereas a Spartan-II slice includes only two 4-input LUTs). Therefore, the 8-bit ASIP should have a slightly better area-time tradeoff than our coprocessor for short messages (according to Table 6 , our coprocessor requires 596 clock cycles to perform the key expansion step and encrypt a 128-bit plaintext block). For long messages, our architecture should be a better choice. -Hämäläinen et al. [24] have designed several AES-128 cores implementing encryption and key expansion. The throughput varies between 121 Mbps and 232 Mbps according to the optimization criterion (area, power, or speed). Since they have synthesized their core to gate level using a 0.13 µm standard-cell CMOS technology, it is again difficult to make a comparison between their work and our architecture. -Helion Technology [26] is selling a tiny AES core that implements encryption, decryption, and key expansion at all levels of security. The coprocessor occupies only 97 Virtex-5 slices and achieves a throughput of 78 Mbps in the case of AES-128. The slice count is reduced to 88 on a Virtex-6 device, and the throughput of AES-128 is equal to 83 Mbps. Our coprocessor is twice as big, but we achieve a better encryption/decryption rate and improve the area-time product compared to the tiny AES core designed by Helion Technology. Thus, combining the hash function ECHO with the AES does not impact the overall performance of the latter.
5.2 Low-resource SHA-1 and SHA-2 cores Table 7 summarizes the result reported by Helion Technology [25] for their family of compact SHA-1 and SHA-2 cores on Virtex-6 FPGAs. The unified core for SHA-1, SHA-224/256, and SHA-384/512 turns out to be larger and slightly slower than our coprocessor. Furthermore, the Helion commercial core must be supplemented with an AES core to provide the same functionalities as our architecture. If we assume that the security of ECHO is at least as good as the one of SHA-2, ECHO is a clear winner for resource-constrained devices.
Round two SHA-3 candidates
A few researchers have proposed compact implementations of a subset of round two SHA-3 candidates. Table 8 provides the reader with a comparison of coprocessors optimized for Virtex-5 devices (note that BLAKE and Keccak have been selected as finalists in December 2010). Compared to the ECHO-256 coprocessor we described in [8] , our new architecture also provides the user with ECHO-512 and the AES (encryption, decryption, and key expansion at all levels of security) at the cost of 66 slices. Thanks to a better pipelining, we also managed to achieve a slightly higher clock frequency in this work.
Shabal [11] ranks first in terms of throughput and areatime trade-off. Detrey et al. [14] noted that only a small fraction of the internal state of Shabal is used at any step of the algorithm. They exploited this fact and minimized the area of the circuit by taking advantage of the dedicated shift register resources available in the recent Xilinx devices (SRL16 primitive). Combined with a tiny AES core, Shabal is an excellent candidate for low-area implementations on Xilinx devices. However, porting this coprocessor to FPGAs that do not embed SRL16-like primitives might have an important impact on the overall performance. The architecture described in this work includes only a small number of SRL16 primitives in order to synchronize control signals. Therefore, it should be more portable than the Shabal coprocessor designed by Detrey et al. [14] .
Several researchers provided the scientific community with comparisons of parallel architectures for the 14 round two SHA-3 candidates (see for instance [27] ). The main criticism leveled at ECHO is its poor throughput to area ratio when compared to most of the round two SHA-3 candidates. Our results contradict previous studies: as long as compact implementations are concerned, ECHO offers for instance a better area-time trade-off than Keccak or BMW. When the coprocessor must offer several digest sizes and AES encryption/decryption, ECHO should also perform better than BLAKE.
Conclusion
We described a low-area coprocessor for the AES (encryption, decryption, and key expansion) and the cryptographic hash function ECHO at all levels of security. Our architecture is built around an 8-bit datapath and the ALU performs a single instruction that allows for implementing both algorithms.
Thanks to a careful organization of AES and ECHO internal states in the register file, the control unit remains simple, despite the various addressing schemes required for the different steps of the AES and ECHO: all read and write addresses are generated by means of a modulo-16 counter and a modulo-256 counter. Our results show that:
-At the cost of 66 slices, one can modify the ECHO-256 coprocessor we described in [8] in order to include ECHO-512 and the AES (encryption, decryption, and key expansion at all levels of security). Thanks to a better pipelining, the throughput of our novel architecture is even slightly improved. -Our coprocessor improves the area-time product compared to the tiny AES core designed by Helion Technology [26] . Combining ECHO with the AES does not impact the overall performance of the latter. -Assuming that the security guarantees of ECHO are at least as good as the ones of the SHA-3 finalists BLAKE and Keccak, ECHO is a better candidate for low-area cryptographic coprocessors.
Furthermore, we believe that the design strategy we proposed in this work can be applied to the SHA-3 finalist Grøstl [21] . We expect to obtain a much more compact unified coprocessor (AES and Grøstl) than the one described by Järvinen [28] .
