The traditional advanced encryption standard (AES) implementations based on four lookup tables (4-T) of 1 KB size, have high encryption performance, whereas face access-driven cache attack at the same time. In this paper, we present an AES implementation based on one lookup table of 512 B with optimised structure, named 1-T, to improve the access-driven cache attack resistant ability. Furthermore, we optimise the implementation of round function of 1-T to eliminate the speed influence from the shrunken lookup table. The experiment result shows that attack resistant ability of 1-T is much higher than 4-T's under the same cache setting; and encryption time of 1-T is increased by 43.5% and 106.3% than 4-T's on the ARM and the ×86 platform respectively, but storage overhead is only 28% of 4-T's.
Introduction
Advanced encryption standard (AES) (Daemen and Rijmen, 1998 ) is a widely used symmetric cipher cryptography algorithm in information security field. Although AES has high security and flexibility, the numerous GF(2 8 ) multiplications make it hard to be used in the high real-time required or resource-constrained systems. By replacing some high complexity operations with look up tables, the software implementations based on lookup table (LUT) can achieve very high encryption speed. Gladman (2002) proposed a four LUTs-based implementation (4-T) which replaces the round function with 16 times of accessing LUTs and 16 XOR operations. This method reduces the time consumption of round function significantly and is the fastest non-parallelised software implementation which has been adopted by many security systems like OpenSSL (Viega et al., 2002) . Moreover, the LUT-based AES software implementations are more applicable than the ones based on hardware accelerators (Rahimunnisa et al., 2014; Abdellatif et al., 2014; Chang et al., 2013; Swankoski et al., 2005) or the ones based on instruction set extension (Rott, 2012; Lee and Chen, 2010; Yumbul et al., 2014) due to its' hardware-independency characteristic.
However, the LUT-based AES software implementations are often running on the processors with cache. The access of cache includes two statuses HIT and MISS. The two statuses consume different times and can be used to figure out the access address. Cache attack (Osvik et al., 2006) utilises the differences of access consumption to analyse the accessed indices of the LUT during the encryption procedure and then figures out the key easily.
To avoid the threats of cache attack, many cache attack resistant AES designs with the basic idea of eliminating the consumption differences of cache access were proposed. Disabling cache is a direct way to eliminate the difference, but it would increase the time to access the memory and decrease encryption performance. Avoiding the use of LUTs (Bertoni et al., 2003; Atasu et al., 2004; Käsper and Schwabe, 2009) can eliminate the cache access consumption difference effectively, but their encryption performance is much worse than the LUT-based methods. Pre-loading LUT into cache or specifying registers before access can avoid cache miss, but this requires extra hardware and additional time for pre-loading. One way is to add random loops or sleeps to the encryption, which can disturb the time analysis of attacker, but this method will also decrease the encryption performance seriously (Jayasinghe et al., 2014) . Cache-dividing avoids the pollution of the cache storing LUT and fixes the access time, but requires extra hardware, and lacks applicability. Since pipeline blocking caused by cache miss is due to the data dependency, Jayasinghe et al. (2014) eliminate the pipeline blocking by inserting enough data-independent instructions between the data-dependent instructions. This method fixes the access time for both of cache HIT and MISS, however, it treats each access as cache miss and drops encryption performance seriously especially when the time consumption of cache MISS is more than ten times longer than that of cache HIT.
In conclusion, cache attack to AES can be divided into three types: time-driven (Bernstein, 2005; Bonneau and Mironov, 2006; Acıiçmez et al., 2007; Wang et al., 2012) , trace-driven (Bertoni et al., 2005; Acıiçmez and Koç, 2006; Gallais et al., 2011] and access-driven (Osvik et al., 2006; Tromer et al., 2010; Zhao et al., 2011) . Access-driven is the fastest and most effective method.
Except cache attack, embedded systems like WSNs face the problem of how to realise high performance AES on resource constricted platform (Zhu et al., 2015) , especially in TDMA-based WSNs (Wang et al., 2013 ) that require real-time requirement. This paper propose a novel 1-T LUT implementation of AES. Compared to 4-T, 1-T is based on a 512 B LUT with novel structure, and has much higher cache attack resistant ability, along with a good performance under low storage and power consumption.
The remainder of this paper is organised as follow: Section 2 briefly describes AES mechanism and the traditional implementation based on four 1 KB LUTs; the access-driven cache attack is described in Section 3; Section 4 describes the novel 512 B LUT and the optimised round function of 1-T; Section 5 verifies the attack resistant ability and consumption of 1-T; and the conclusion is drawn in Section 6.
AES encryption

AES mechanism
AES is block cipher algorithm, and it supports 128/192/256 bits block sizes and requires N rounds (N = 10/12/14) of encryptions. Figure 1 shows the architecture of AES encryption. The encryption operates on the plaintext block expressed in 4 × 4 state array A. KE generates the round key for each round. AK combines A with KR using bit-wise XOR; SB replaces each byte of A with another according to a LUT S-box. SR shifts the last three rows of A cyclically a certain number of steps. MC combines the four bytes in each column of A by doing GF(2 8 ) multiplication with mixing array Assume 
can be calculated by the round function as formula (1). 
where S(a) represents the result of SB, looking up the certain byte corresponding to a from S-box, ⊗ represents GF( 2 8 ) multiplication, and ⊕ represents XOR. 
AES implementation based on LUT
By formula (1), the array multiplication of each round function requires 64 GF( 2 8 ) multiplications, so the entire encryption requires 64*(N -1) times. Moreover, GF(2 8 ) multiplication is not supported in most of CPUs, and it requires a software implementation. Our experimental result shows that MC consists all the GF(2 8 ) multiplications of the entire encryption and consumes 95% cycles of the entire encryption. Since the mixing array only consists three numbers 1, 2 and 3, the GF(2 8 ) multiplications can be taken place by looking up three tables which store the results of 0~255 multiply with the three numbers separately, so that the cycles taken by MC can be reduced significantly. According to formula (1), each column of B can be computed as formula (2). Where C 2113 , C 3211 , C 1321 and C 1132 represent each column of mixing array separately. According to formula (2), the round function is equivalent to multiply each element of A with some constant numbers and then do XOR operation with KR. 3 Access-driven cache attack
Cache access
Cache is employed to bridge the gap between the high speed of CPU and the low speed of memory. The size of cache can be computed as Size = B • W • S, where B is the size of cache block, W is the block number of each cache set, and S is the number of cache set. When CPU is going to access a memory data x, it firstly looks up the cache set which corresponding to x's address <x>, the cache set is calculated as <x> / BmodS. If exist (cache HIT), the cache returns the data directly, otherwise (cache MISS) the cache reads data block <x> / B from memory and loads it to a certain cache block of cache set <x> / BmodS, then returns the data. The cycles taken by CPU to access a data are different for cache HIT or MISS, generally 1~2 cycles for cache HIT, while tens even hundreds of cycles for cache MISS due to the memory speed.
Attack mechanism
According to the AES encryption procedure and formula (2), the first round encryption needs to access LUTs for each element in
, where a i and k i are the element of A and Key, respectively. Thus, the indices accessed of the table would leak the information of a i ⊕ k i . Since a i is known, k i can be obtained through simple XOR operation. Based on this observation, cache attack of the first round encryption of 4-T method is as follows: (4), thus the LUT for the four elements in the second round encryption leak the information of k i . Osvik et al. (2006) attacks the second round encryption to determine the low log 2 δ bits of each octet of the key by eliminating the candidates in ˆi k corresponding access to { ¬ Block i } according to formula (4). After a certain number of tests with various plaintext samples, the low log 2 δ bits of each octet of the key can be determined. 
Attack complexity
For key attack, the more samples and tests are needed, the longer time is taken to attack, which makes the AES implementation securer. Assume each LUT is independent, the proportion ¬ Block i appearing in an encryption can be expressed as follow:
where m is the accessing amount of LUT in each round encryption. Since N is corresponding to the length of key and fixed, thus 
According to formula (6), ˆi k N is close to 1 (the correct k i is gained) as s N increases. Thus the complexity of cache attack can be determined as N s the number of the samples required to eliminate the number of the candidates of each octet of the key to 1.
According to formula (7), N s is inversely proportional to log(1 -
, that is, the greater δ and m are, the more samples are required to determine the key. Since δ and m are related to the size of cache block B and structure of LUT respectively, given a certain settings of cache, optimised structure of LUT with increasing δ and m would increase the cache attack complexity, thus improving the security of AES implementation. 
1-T mechanism
This section first optimises the structure of the 1*512 B 
and 0 , 0, 5, 10, 15
S , 0,5,10,15 
S a S a S a S a S a S a S a S a S a S a S a S a S a S a S a S a kr kr kr kr
where S 12 (a i ) is the half-word item in 1*512 B LUT. The above formulas only show the computation process of first column b 3 b 2 b 1 b 0 T , the other three columns can be calculated in the similar way. Considering shift operation only, this method requires 64 more times than 4-T, which has much more computing overhead.
In the 1*512 B table, there are two arrangements for the items: S 1 (a)S 2 (a) and S 2 (a)S 1 (a). No matter which arrangement is applied, it needs to access the table once in a round. In order to generate the four words S 2113 (a), S 3211 (a), S 1321 (a) and S 1132 (a) of 4-T, we need to access the table to get S 1 (a)S 2 (a) or S 2 (a)S 1 (a), split it into S 1 (a) and S 2 (a), and calculate S 3 (a). Replace S 3 (a) with S 1 (a) ⊕ S 2 (a) we found that, the four words contain only three adjacent situations of S 1 (a)S 2 (a), as shown by the shaded ellipses in Figure 2 , and no adjacent situation of S 2 (a)S 1 (a), but contain six adjacent situations of S 2 (a)S 1 (a), as shown by the shaded ellipses in Figure 3 . For both half-words S 1 (a)S 2 (a) and S 2 (a)S 1 (a), only one shift operation is needed to move it into the words. That is to say, applying the arrangement of S 2 (a)S 1 (a) in the 1*512 table can save more cycles for combination word results. Thus, we use S 2 (a)S 1 (a) as S 21 (a) and create the LUT T 21 to store S 21 (a). The optimised round function of 1-T can be express as formula (12). Table 2 . From Table 2 , we learn that the round function of 4-T method contains only LUT and Add operations since it stores the words results. 1-T needs additional operations Gen-S3, Gen-Byte and Gen-Word to construct the words of 4-T, the total operations of both the 1-T's round functions before and after optimisation are 6.5 times and four times of 4-T's separately. Figure 4 shows the procedure of access-driven cache attack. The cache access information in real attack can be collected by the method mentioned in Osvik et al. (2006) and Tromer et al. (2010) . Since this verification focuses on the impact of LUT size, the cache access information is generated by mapping the indices of each look up table during the AES encryption to the cache set directly. Assume that the cache is big enough to store the whole LUT and no cache pollution happens during the encryption. Figures 5(a) to 5(d) are the comparison of sample counts vs. eliminated candidates counts during the first round attack to 1-T and 4-T under different cache block sizes. The curves of real tests (named test) are close to the ones computed by formula (7) (named theoretic), and grow with step-like change. This is because one or several ¬ Block i can be found after a certain number sample tests, and each ¬ Block i will eliminate n*δ candidates. Results shown that both test and theoretical performance of 1-T are better than of 4-T, namely, more sample tests are required to eliminate the same count of candidates for 1-T than for 4-T. In addition, it can be observed that the test result curves stop increasing up after a certain count of eliminated candidates (128*16 -16 • δ) . This is because the first round attack can determine the high log 2 (256/δ) bits of each octet of the key at most, remaining δ candidates for each octet.
Verification
Security performance of defending cache attack
Specially, the Figure 6 shows the comparison of the second round attack to 4-T and 1-T at different sizes of cache block. From the figure, we learn that sample counts of the theoretic computed by formula (7) (Theoretic) are ten times more that the real test values (Test) of 4-T and 1-T. This is because formula (7) just considers the first round attack, which eliminates the candidates of each octet of the key octet by octet, while the second round attack eliminates the candidates of the 16-octet key one by one and is unsuitable to formula (7). And also because that, to speed the simulation, the samples used in the first round attack causing ¬ Block i appearing are reused in second round. Results show that the sample counts for attack 1-T are at least 100 times greater than for attack 4-T at various size of cache block.
Storage consumption
Storage consumption of 1-T is tested on ARM platform with ARM CortextM3 processor STM32F103RE working at the frequency of 72 MHz, which is the commonly used MCU in resource constricted WSN applications. The different software AES implementations including direct GF computing without LUT, HW accelerating using AT86RF231 and two LUT-based methods are compared As shown in Figure 7 , the code storage and the data storage of HW are the smallest since most of the computation of HW is performed on the hardware and most of the code comes from the SPI communication between MCU and accelerator. The code storage of all four methods show less difference and 1-T is the highest due to the complex round function, but just 29.3% higher than that of the lowest one HW, only increased 168 bytes. For data storage, 4-T's is the highest due to the four 1 KB LUTs; GF's is the lowest since it stores a 256 B S-box only; 1-T stores a 512 B LUT and its data storage is 57.6% higher than GF's and only 15.4% of 4-T's. Figure 8 shows the time consumption of the four implementations to encrypt a plaintext block on the ARM platform and the ×86 platform (AMD Phenom II ×4 B40, 3 GHz). On ARM platform, although HW performs AES on hardware, its computing time is more than double that of 4-T's and 1-T's, due to the extra time consumption taken by the SPI between the radio chip and the CPU. Since GF computes the complex and numerous GF( 2 8 ) multiplications directly, it consumes much higher time than the other three ones. 4-T consumes the least time due to the simplest round function; 1-T requires additional operations to extract byte results and generate word results, which consumes more time. Compared with 4-T, 1-T only added 43.5% time consumption on ARM platform and 106.25% x86 platform. The reason of the difference between two different platforms is that, memory access instructions have different cycles on different platforms; and also, the shifting operation in ARM can be appended to the end of other instructions and does not consume any cycle, the round function of 1-T contains a plenty of shifting operations, thus saving a great number of cycles on ARM platform.
Time consumption
The ratio of the time consumption between 1-T and 4-T is different from the ratio of the operation statistics in Table 2, mainly because Table 2 considers the operations of round function only, which take 66.74% and 89.47% of the entire encryption time on ARM platform for 4-T and 1-T separately, while the operations taken to read state and round key, to write back new state and to add input key, and the operations taken by the last round encryption are not considered. 
Energy consumption
Energy is sensitive to the embedded systems supplied power by battery. This part verifies the energy consumption taken to encrypt a plaintext block on ARM platform. 1-T, 4-T and GF perform all the operations on CPU, and their energy can be computed as follow formula (13).
( 1 3 ) where I CPU is the current only CPU working; U is the supply voltage; t CPU is the time taken. The energy consumption of HW includes the part of CPU and the part of radio chip, and can be divided as: the energy consumed when data is transmitted between CPU and radio through SPI; and the energy consumed when AES hardware is working and CPU is waiting. The energy can be computed as formula (14). ( 1 4 ) where I CPU+SPI+Radio and t CPU+SPI+Radio are the current and time when CPU, SPI and radio chip are all working; I CPU+Radio and t CPU+Radio are the current and time when CPU and radio chip are both working. The supply voltage is 3.3 V and the currents of different working modes are shown in Table 3 . The energy computed is shown in Figure 9 . According to Figure 9 , GF consumes much more energy than the other three owing to the time-consuming GF(2 8 ) multiplications; 4-T consumes the least energy owing to the shortest encryption time; HW consumes the second most energy; 1-T's energy consumption is proportional to its time consumption and is 43.65% higher than 4-T's but only 39.14% of HW's. 
Conclusions
According to the access-driven cache attack problem of traditional AES implementations based on four LUTs, this paper designs a novel 512 B LUT with high access-driven cache attack resistant ability, and then proposes a fast and low storage consumption AES implementation 1-T based on the 512 B LUT. On the platform with cache, 1-T's attack resistant ability is over 100 times higher than 4-T's, and increases as the cache block size increases. On the resourceconstrained platform without cache, 1-T can be used as an AES implementation in embedded systems with good storage, power consumption and performance tradeoffs. The experiment result shows that 1-T's storage consumption is only 28.0% of 4-T's and 2.5 times of HW's; 1-T's time and energy consumption are both 43.5% higher than 4-T's, but just 45.1% and 39.1% of HW's separately.
