Abstract-Nowadays programmable devices (microprocessors and DSPs) are based on complex architectures optimized for obtaining maximum speed performances that degrades when the implemented application is mostly based on operations on single bit or subset of bits. This kind of data processing and bit manipulation operations can be accelerated by using a Reconfigurable Functional Unit (RFU). In this paper the benefits of using the ADAPTO RFU (Adder-Based Dynamic Architecture for Processing Tailored Operators) [1] [2] to speed up the Advanced Encryption Standard algorithm (AES) is investigated. The paper shows how the ADAPTO architecture is useful for the acceleration the AES algorithm due the efficient implementation of the most complex operations of the algorithm. A comparison in terms of number of assembly instructions is given.
I. INTRODUCTION
The spreading diffusion of digital techniques for data and signal processing is pushing toward microprocessor and DSP architectures of increasing efficiency.
An important limitation for such new structures is their efficiency in computing operations based on a reduced data parallelism. Examples of these operations are bit permutations, where input bits are rearranged individually or on the basis of short subwords [4] , and polynomial multiplication in Galois Field (GF), where only some input bits are multiplied (XORed) by the polynomial coefficients [10] .
Of course, these nonstandard operations can be performed by conventional processors, but their implementation requires several standard instructions. Despite its large use, this approach is not efficient and reduces the processor speed performance.
Several solutions have been proposed in the literature to overcome this drawback, either in software [4] or hardware. Among the hardware solutions the most interesting ones, in terms of flexibility and performance, are based on Reconfigurable Functional Units (RFUs). Proposed RFUs are similar to small FPGAs (array of LUTs and pass-transistors for the programmable interconnect) and are connected in parallel to the ALU, sharing the Register File (RF) and working as an hardware Instruction Set (IS) expansion [6] .
RFUs are very different from conventional coprocessors in expanding the core instruction set, but coprocessors are not integrated in the datapath unit requiring the use of the system bus to exchange data.
In this scenario, great efforts are devoted to develop efficient processors for embedded systems applications. In this case, apart from evaluations based on computational performance, very critical constraints are represented by power consumption and cost factors that are strongly related to the complexity of the architecture and consequently to the silicon area.
To face these constraints in [1] , [2] the authors proposed a new architecture named ADAPTO in the which LUTs [5] used for the implementation of general purpose logic have been replaced by Full-Adders (FAs). The resulting architecture is less expensive with respect to those proposed in the literature, but this favorable characteristic is counterbalanced by a reduced flexibility.
In order to evaluate the trade-off between complexity and flexibility, the authors defined a set of experiments based on typical embedded systems applications. Those applications have been executed by using ADAPTO and a general purpose microprocessor (by using an architecture emulator) in order to verify the speed-up factor [3] .
In this paper we shows that ADAPTO can also be used to accelerate the Rijndael AES algorithm [10] , that is based on operations on GF (2 n ). The paper is organized as follows. In Section II ADAPTO is briefly described, while Section III illustrates the the AES algorithm. In section IV the ADAPTO implementation of the AES algorithm is presented with the evaluation of the obtained speed-up. Finally, in Section V the conclusions are drawn.
II. THE ADAPTO ARCHITECTURE
The ADAPTO RFU architecture is based on three alternated stripes of Logic Blocks (LBs) and interconnect with a parallelism of 32 bits (both for inputs and outputs). The architecture has been conceived to be connected to the main processor Register File (RF). LBs are based on FAs that perform both logical and arithmetical operations meanwhile the interconnect is based on pass transistors (as shown in Fig. 1 ). Multicontext is implemented by context memories (LBs and interconnect programming).
FAs can be configured to execute one bit addition, NOT and PASS, 2 input AND, 2 input OR, 2 and 3 input XOR, and 3 Majority.
The structure of the interconnect is shown in Fig. 2 , and is based on a multicontext approach (for an high reconfiguration speed). Each LB output can be linked with any inputs of the LBs of the bottom row. In addition to the 32 inputs coming from the upper LBs, two additional lines directly connected 0 
III. THE AES ALGORITHM
In the AES algorithm (that is based on byte operations) the encryption of a data block is composed by: a XOR step, several round transformations (in the following simply rounds), and an final round (different by the previous ones). In the case of 128 bits blocks, the data are arranged as 16 bytes (the state) and are organized in a matrix as follows
The encryption of a 128 bits block requires 9 rounds, each one composed by four processing steps 1) SubBytes: a non-linear substitution step where each byte is replaced by addressing a Look Up Table (LUT). 2) ShiftRows: a transposition step where each row of the state is cyclically shifted a for a number of steps. 3) MixColumns: a mixing operation operating on the columns of the state, combining the four bytes in each column 4) AddRoundKey: each byte of the state is combined with the round key. In the final round, the Mixcolumn transformation is not performed. For the decryption, the inverse transformations InvSubBytes, InvShiftRowsInvMixColumns and a slogtly dofferent AddRoundKey are used.
In a software implementation of the AES encryption/decryption, SubBytes and InvSubBytes operations are efficiently implemented by using 256 bytes LUTs, while ShiftRows and InvShiftRows are byte reorderings and correspond to a simple software implementation. Moreover, ShiftRows and InvShiftRows can be merged with SubBytes and InvSubBytes and therefore their impact on the performance of the AES algorithm is negligible. Also AddRoundKey a two inputs 32-bit XOR operation can be efficiently implemented in standard software.
The operations that can take advantage by using an RFU are MixColumns/InvMixColumns due to their implementation complexity that on a ARM926EJ-S RISC requires about 50 assembly instructions. The MixColumn transformation is a based on matrix multiplication in GF ( 2 8 ) where each column of I is multiplied by each row of
obtaining 16 results. Each element of the matrix (hexadecimal notation) represents a polynomial of degree 7 with coefficients corresponding to the its binary representation (e.g. 0x0A =0b00001010 = x 3 + x). We remark that constant multiplications are performed on GF (2 8 ) by using
For the InvMixcolumn transformation each column of I is multiplied by each column of the inverse of M
0x0E 0x0B 0x0D 0x09 0x09 0x0E 0x0B 0x0D 0x0D 0x09 0x0E 0x0B 0x0B 0x0D 0x09 0x0E
Multiplications involved in InvMixcolumn are more complex than multiplications involved in Mixcolumn and represents the more expensive task in the AES algorithm. In the next section we show how these operations are implemented by using ADAPTO.
IV. ADAPTO IMPLEMENTATION

A. Data allocation strategy
The allocation of data has been performed considering the RF of a 32 bit microprocessor. The implementation of Mixcolumn and InvMixcolumn operations in ADAPTO requires to choose an appropriate organization to store the 16 bytes of I in the RF [1] . In our approach, I has been memorized by column (Table I) . In this way ADAPTO can load an entire column at each clock cycle. In this subsection, GF (2 8 ) constant multiplication is illustrated (GF multiplication corresponds to a conventional polynomial multiplication followed by a division by the polynomial generator). In order to illustrate the use of ADAPTO, we consider the following example. Take into account the multiplication of a generic 8-bit polynomial
by the constant polynomial (x + 1), corresponding to the hexadecimal number 0x03. The operation 0x03 ·P in the GF gives 0x03·P = (
. Shortly we can write 0x03 ·P = P + (P << 1) + p 7 · (x 4 + x 3 + x + 1) Figure 3 shows the ADAPTO implementation of the constant multiplication by 0x03. The shift operations, as the left shift A << 1 required in the above multiplication, are performed directly by the ADAPTO interconnection network. Also a 7 · (x 4 + x 3 + x + 1) can be implemented by the interconnection network. In fact, let us call Z=z 7 z 6 z 5 z 4 z 3 z 2 z 1 z 0 the byte representing the result of a 7 · (x 4 + x 3 + x + 1). We have z 7 = z 6 = z 5 = z 2 = 0 and z 4 = z 3 = z 1 = z 0 = a 7 . Therefore the interconnect can compute Z by imposing some LB inputs to zero and connecting a 7 to the remaining LBs. The stripe following the interconnection is configured as a three input XOR, performing the required three additions on GF (2 8 ).
Fig. 3. Implementation of 0x03·B in ADAPTO
Any constant multiplications that can be translated in a bit rearrangement and a sum of three terms can be performed with the above method. In our work any constant multiplication is computed using the set of basic multiplications shown in the Table II. C Implementation C * ·P 0x02 (P << 1) + p 7 · (x 4 + x 3 + x + 1). 0x03 P + (P << 1) + p 7 · (x 4 + x 3 + x + 1) 0x04 (P << 2) + p 7 · (x 5 + x 4 + x 2 + x)+ +p 6 · (x 4 + x 3 + x + 1)
TABLE II BASIC CONSTANT MULTIPLICATIONS
C. ADAPTO implementation of Mixcolumn
Using the above results, in this subsection we describe the implementation on ADAPTO of the multiplication of a row of M by a column of I. We suppose that each column of I is stored in the RF, according to above discussed data allocation strategy. D1, D2, D3 are the ADAPTO input operands coming from the RF, and R is the final result that will be returned to the RF. Moreover α, β, and γ are the partial results present at the output of the three LB stripes of ADAPTO.
As an example, we consider the multiplication of the first row of M by the first column of I . Algorithm 1 describes the implementation of this multiplication. The first stripe of ADAPTO is configured as a passthru and connects directly A, B, C, D (output α) to the first level of interconnect . The constant multiplications 0x02 · A, and 0x03 · B, with the sum C + D are computed by the interconnect and the second LB stripe (output β). It must be noticed that this mixed operations require 24 LBs configured as three inputs XOR, while the eight rightmost LBs are unused. In the algorithm we represent these unused output as don't care '-'. Starting β, the last LB stripe, configured as three input XOR, provides the final mod. 8 sum γ.
The product of the same column by a different row requires the reconfiguration of ADAPTO. Therefore the entire MixColumn operation requires four different ADAPTO contexts (of the 16 available in the current architecture). However, the computation of the product of different columns by the same row is computed by using the same context. By using Algorithm 1 each row by column multiplication requires 1 ADAPTO instruction. Consequently the whole MixColumn operation is computed in 16 assembly instructions (corresponding to 16 clock cycles, since one clock cycle is required for each ADAPTO context [1] , [2] ). This implementation has been compared with a Mixcolumns software implementation present in the benchmark suite described in [11] . This function, compiled on a ARM926EJ-S RISC, architecture requires about 200 assembly instructions. Thus the speed-up obtained with ADAPTO is about 12.5x
D. ADAPTO implementation of InvMixcolumn
The InvMixColumn operation I × M −1 is more complex due to the structure of the entries of the matrix M −1 (here I is different from that used in Mixcolumn). To simplify the computation, the constant coefficient of multiplications are expressed in terms of the elementary constants of Table II.  Table III is shows the decomposition of the multiplications by complex constants. The decomposition of Table III can require more contexts for the computation of a constant multiplication. For example, the multiplication by 0x0E is performed in two phases (corresponding to 2 ADAPTO contexts). For the multiplication of first row of M −1 by the first column of data matrix I we decompose the computation it in two terms
The result R is evaluated in two phases. In the first phase, the first partial results are computed by ADAPTO as constant multiplications of A, B, C, D by 0x04, (corresponding to α) followed by four multiplications by 0x03 or 0x02 (γ computation). These operations are implemented in the first ADAPTO context. In the second phase the input D1 contains A, B, C, D, and the inputs D2 and D3 store the results of the previous phase. We use the ADAPTO inputs D1 and D2 to compute 0x0C · C + C and 0x08 · D + D. Instead D3 is used to input the previous results to the input of the second stripe of LB. So we compute the three terms 0x0C · A + 0x02 · A 0x03·B and 0x08·B +0x0D ·C +0x09·D. The third strip of the ADAPTO sums (XOR) these three terms. The two phases are shown in Algorithm 2.
The two constants of Algorithm 2 are valid until the row of M −1 is unchanged. When we go to another row the constants change and two other contexts must be used. Consequently, the computation of the whole InvMixColumn in principle requires 8 contexts. Some simplification can be carried out observing the properties of the entries of M −1 . For example, PHASE 1 can be shared between the first and third rows, and between the second and the fourth rows. In fact the first and the third rows have the common term (0x0C · A + 0x08 · B + 0x0C · C + 0x08 · D), while the second and the fourth rows have (0x08·A+0x0C·B+0x08·C+0x0C·D). This property allows to reduce the number of contexts and number of time the contexts must be reconfigured Therefore the computation of In particular, we show that MixColumns and InvMixColumns can be performed by ADAPTO in 16 and 24 clock cycles respectively, exploiting the 32-bits parallelism for processing multiple bytes in parallel, obtaining a speed-up in the range 8.3-12.5x with respect to the software implementation on an ARM926EJ-S RISC architecture.
