The necessity to secure the communication between hardware components in embedded systems becomes increasingly important with regard to the secrecy of data and particularly its commercial use. We suggest a low-cost (i.e. small logic-area) solution for flexible security levels and short key lifetimes. The basis is an approach for symmetric key exchange using the synchronisation of Tree Parity Machines. Fast successive key generation enables a key exchange within a few milliseconds, given realistic communication channels with a limited bandwidth. For demonstration we evaluate characteristics of a standard-cell ASIC design realisation as IP-core in 0.18µ-technology.
Introduction
For embedded systems like handheld devices, smartcards, mobiles or other wireless communication devices security concepts need to be developed, in order to keep privacy and still (commercially) exploit the merits of such devices in widespread and everyday use [2] . This is, for example, of particular interest for the smartcard-or RFID-industry, where the secrecy of data is directly linked to the commercial prosperity of a product. Also, the economic importance to secure information technology applications in the automotive area is becoming eminent along with the protection of firmware, access control, anti-theft protection, up to scenarios like the hacking of vital vehicle functions such as an antilock braking system (see e.g. [17] ).
Yet, the often relatively small size and severe power consumption constraints of these devices limit the available size for additional cryptographic hardware components [4, 27, 28] . This holds in particular for sensor networks, RFIDsystems and near field communication devices. Secure hardware is thus especially demanded for ubiquitous and pervasive computing, and the need and research efforts manifest in first conferences on security in pervasive computing [26] .
In the following, we introduce the neural network structure and a learning algorithm (section 2), also in order to already point out advantageous properties for a hardware realization. The synchronization effect leading to the key exchange property is explained. Algorithmic security implications on the realization of our architecture are described in section 3. Section 4 comprises the architectural design of the proposed hardware component with its rekeying functionality. Here, we also refer to design decisions prepared in the previous section. In section 5, we present results from an FPGA and an ASIC implementation on silicon area, possible clock and key exchange rates (throughput). We conclude the paper in section 6 with a short summary and an outlook on possible further extensions also referring to current research activities.
Tree Parity Machines for Key Exchange
In [9] , Kinzel et. al. proposed a symmetric key exchange method based on the fast synchronisation of two identically structured Tree Parity Machines (TPMs). The particular tree structure has non-overlapping binary inputs, discrete weights and a single binary output as depicted in Fig. 1a . Studying interacting neural networks in general (cf. [14, 11, 3] ), the authors focused the phenomenom of fast synchronization by mutual learning TPMs and its potential for a cryptographic approach, not involving large numbers and methods from number theory. Their exchange protocol is realized implicitly by a mutual adaptation process between the two parties A and B, not involving large numbers and methods from number theory.
Structure of a Tree Parity Machine
In the following, we describe the implemented parallel-weights version using hebbian learning (cf. [9] ). Weights are identical in both TPMs after synchronisation. The anti-parallel-weights version, using anti-hebbian learning and leading to inverted weights at the other party, is omitted for brevity. The notation A/B denotes equivalent operations for the parties A and B. A single A or B denotes an operation which is specific to one of the parties.
The TPM (see Fig. 1a ) consists of K hidden units (1 ≤ k ≤ K) in a single hidden-layer with non-overlapping inputs (the tree structure) and a single unit in the output-layer.
Each hidden unit receives different N inputs (1 ≤ j ≤ N ), leading to an input field of size K · N . The vector-components are random variables with zero mean and unit variance. The output O A/B (t) ∈ {−1, 1}, given bounded weights w
(from input unit j to hidden unit k) and common random inputs x kj (t) ∈ {−1, 1}, is calculated by a parity function of the signs of summations:
The common random inputs can also be kept secret between the parties, yielding authentication (see Section 2.2). σ is a party-specific modified sign-function, that defines an agreement between the two parties on an opposite sign in case of a sum α
From the communicated output, the outputs of the hidden units cannot be uniquely determined. There are multiple combinations for a signed or unsigned output, depending on the number of hidden units K.
Key Exchange by Mutual Learning and Synchronisation
The so-called bit package variant was chosen for implementation (cf. [9] ). Due to an reduction of (physical) output exchanges by an order of magnitude, it is advantageous for practical communication channels with a certain protocol overhead. Parties A and B start with an individual randomly generated initial weight vector w A hebbian learning rule is applied to adapt the weights, using the b outputs and b sequences of hidden states. They are changed only on an agreement on the parties' outputs. Furthermore, only weights of those hidden units are changed, that agree with this output:
, otherwise.
Updated weights are bound to stay in the maximum range [−L, L] ⊆ Z by reflection onto the boundary values
In iterating the above procedure, each component of the weight vectors performs a random walk with reflecting boundaries. This implies a trajectory in a weight space of (2L + 1) KN points. Two corresponding components in w A kj (t) and w B kj (t) receive the same random component of the common input vector x kj (t). After each bounding operation (Eq. 4), the distance between the components is successively reduced to zero. Synchrony is achieved when both parties have learned to produce each others outputs. They remain synchronised (see learning rule Eq. (3)) and continue to produce the same outputs on every commonly given input. This effect in particular leads to common weight-vectors in both TPMs in each of the following iterations. These weights have never been communicated between the two parties and can be used as a common time-dependent key for encryption and decryption respectively. Such secret key agreement based on interaction over a public insecure channel is also discussed under information theoretic aspects by Maurer [13] , especially with regard to unconditional security. Furthermore, synchrony is achieved only for common inputs. Thus, keeping the common inputs secret between A and B can be used to have an authenticated key exchange. There are 2 KN − 1 possible inputs in each iteration, yielding as many possible initialisations for a pseudo random number generator. Shamir et al. conferred to such a synchronization over multiple rounds as a gradual type of Diffie-Hellman key exchange [12] , because DiffieHellman has a single round that transmits several bits. Obviously, a test for synchrony cannot practically be defined by checking whether weights in both nets have become identical. One therefore tests on successive equal outputs in a sufficiently large number of iterations t min , such that equal outputs by chance are excluded:
The number of outputs (bits) required to achieve synchronisation is lower than the size of the key [16] . Synchronisation time is finite for discrete weights. It is almost independent on N and scales with ln N for very large N . Furthermore, it is proportional to L 2 [16] . Our investigations/experiments confirmed that the average synchronisation time is distributed and peaked around 400 for the parameters given in [9] .
Security and Rekeying Functionality
The symmetric key-exchange protocol can generate long keys by fast calculations and building the secure channel is of linear complexity. It scales with the size K · N of the TPM structure [9] , which defines the size K · N · L of the key. In order to still allow comparisons with the literature we chose L = 4 for our implementation. The time to synchronise roughly doubles in comparison to L = 3, while for the attacker the same time increases by orders of magnitude (see [21] ).
The security of the key exchange manifests in algorithm-specific properties and can be fully exploited by appropriate hardware design. Next to other general (algorithmic) aspects on the security of the exchange method as described in [9, 12] , the tracking of the weights is hard in comparison to synchronisationpractically even harder when implemented in hardware.
The key exchange protocol has been attacked by several eavesdropping approaches, which always require full knowledge of the TPM structure in use. We will describe their basic properties in order to clarify the security aspect. Due to the nature of the key exchange and its attacks, only probabilistic definitions of a 'successful attack' can be provided. One can distinguish between two classes of attacks. The first class comprises attacks, that can be defeated by appropriately increasing the parameter L. The consequence is, that the learning time of an attacker is significantly longer than the synchronisation time. The security increases proportional to L 2 while the probability of a successful attack decreases exponentially with L [16] . Among these so defeatable attacks, which try to synchronise faster than the two parties [8] , are the Naive Attack, that uses a single or an ensemble of several identically structured TPMs. The Genetic Attack even comprises a population of thousands of TPMs, whose internal representations are optimised by a genetic algorithm [12] . A successful attack is defined here as synchronising faster than the parties A and B and could be realized for K = 2 in 50% of all cases. But, already for L = 3, this attack has shown to be less effective than the Flipping Attack. The complexity of such attacks (especially in hardware), with hundreds or thousands of TPMs plus an additional (genetic) algorithm, is obviously high. The Flipping Attack defines a successful attack as having 98% overlap with the weights of one party, when parties A and B are already synchronous. For L > 2, an ensemble of 10000 Flipping Attackers was found less effective than a single attacker, which revokes its practical use [8] .
All of the previously sketched attacks can be made arbitrarily costly and thus practically defeated by increasing L, which significantly decreases the probability of a successful attack. The approach thus remains computationally secure for sufficiently large L [21, 25] .
The only attack, which does not seem to be affected by an increase of L (but still by an increase of K) is the socalled Majority Flipping Attack. It uses a hundred of coordinated and communicating TPMs [25] . Yet, the given definition of a successful attack is problematic: When A and B have fully synchronised, the attacker has 98% average overlap (i.e. a fraction) with the weights, in 50% of all cases. For 99% average overlap, the probability reduces to 25%. This indicates that the difficulty lies in achieving the last percents. The authors chose this definition, because of the strong fluctuations they observed in the success probability. But the definition of overlap is an average overlap over all hidden units. Thus an attacker does not know, which of the K · N components of the weights (the key) are correct in a real attack scenario. In currently used symmetric encryption algorithms, the flipping of a single bit only already leads to a complete failure in decryption. Thus practically, one still has to perform a subsequent brute-force attack, which would then only be successful in 50% of all cases. Keep in mind, that A and B already have one key (while the attacker has 98% with a probability of 0.5) and start encryption and data transmission. The attacker needs to perform his brute force attack plus the attack on the encrypted data in parallel. Furthermore, the rekeying principle and the achievable short key lifetimes (cf. Section 3.1) aim at an online usage of the exchanged keys for secure transmission.
All formulated attacks can hardly be performed online. Only an offline attack on the previously recorded exchanged information seems realistic. Last but not least, note that all of the existing attacks are based on knowing the common inputs and thus refer to a non-authenticated key exchange, in which man-in-the-middle attacks are possible as well.
Feasible Immediate Rekeying
We propose to minimise the key lifetime as much as possible employing immediate rekeying, that allows to exploit the speed of key exchange and features of our hardware component. Such a rekeying process normally is to be avoided due to the computational cost of a new key exchange. Strategies are developed to increase the key lifetime without affecting the security (see e.g. [1] ). Yet, using the TPM principle allows for efficient rekeying in the kHz-range (see also Section 5).
Next to several other propositions (cf. [9] ) concerning the en-/decryption, one particular proposition (cf. [11] ) is to take each (common) weight vector after synchronisation for en-/decryption. On the one hand, a new potential key is present in each step, which can then be used block-wise. On the other hand, an opponent then also has the chance to synchronize using the ongoing communication (cf. Section 2.2) and get a key.
We suggest to permanently generate (i.e. synchronise) new keys in parallel or multiplexed to the encryption-transmission-decryption of data, using the most recent key exchanged. In this case, the key is only used to encrypt a certain small subset of the plaintext. As soon as a new key has been exchanged, it is used for encryption. This especially allows to realize short key lifetimes, enabling a certain security level by many smaller keys instead of one large key.
Consequently, in our hardware design, we allow an external unit to demand a key exchange service. Our TPMRAs will continuously synchronise new keys, as long as data needs to be exchanged. Once a crypt-unit uses the first key, synchronisation is triggered again to always provide a next key. In this way, the related hardware resources are consequently used and keys are exchanged at a maximum rate subject to hardware constraints, average synchronisation time and available channel bandwidth. Furthermore, it allows to implement services like periodic or even adaptive rekeying. Security is thus increased in our hardware implementation through feasible immediate rekeying, the mere speed of key exchange and the achievable short key lifetimes.
Tree Parity Machine Rekeying Architectures
It is important to note that, with respect to a hardware implementation, only signs and bounded integers are processed within the algorithm. The result of the outer product in Eq. (1) can be realized without multiplication. The product within the sum is only changing the sign of the weight. Thus, the most complex structure to be implemented is an adder. The complexity of such a unit is thus even less than the complexity of a linear filter, which requires a full multiplyaccumulate structure. Yet, the inherent parallelism can be exploited here as well. The branches in Eq. (2) are only based on a test for the sign or a test on equality to zero, also easily done in hardware.
Furthermore, only sign-operations and additions are present in the learning rule (Eq. (3), well suited for a hardware implementation. The bit package exchange can either be realized serially or via a parallel bus, depending on the users requirements and the intended application. The amount of registers needed for storage increases in the bit package variant, finally imposing a tradeoff area vs. speed.
Equal (pseudo-)random inputs are realized by equally initialised Linear Feedback Shift Registers (LFSR) or a Cyclic Redundancy Code (CRC). Different (secret) initial weights can either be fixed (device-specific), or they can be provided by an additional application-specific device or by a thermal noise device. The synchronisation criterion (Eq. (5)) basically comprises a counter.
The proposed Tree Parity Machine Rekeying Architectures (TPMRAs) are functionally separated into two main structures. One structure essentially comprises the Key Handshake and Bit Package Control. The other structure contains the TPM Unit and its control state machine.
Key Handshake and Bit Package Control
As described in Section 2.2, we implemented the bit package generalisation of the protocol (cf. [9] ). The overall structure of the TPMRAs is shown in Fig. 2 . It consists of three functional blocks: a Key Handshake and Bit Package Control, the TPM unit and a Watchdog timer.
The Watchdog timer supervises the number of interactions needed for a keyexchange between two parties (Eq. (5)). If there is no synchronisation within a specific time (remember that the synchronisation time is distributed), a signal (sync error) indicates a synchronisation error. It is programmable for variable average synchronisation times subject to the chosen TPM structure. The Key Handshake and Bit Package Control handles the key transmission with an encryption unit and the bit package exchange process with the other party. It accomplishes the bit packaging by partitioning the parity bits from the TPM unit in tighter bit slices. Due to different computation cycles between two key exchange parties, the rekeying procedure employs a key request (req key), a key changed (key cha) and a key commit (key com) handshake protocol (see Fig. 2 ). A key is handed over via the internal bus (Key) to an encryption unit when the synchronisation process is finished. For our application domain in embedded system environments, we choose a fixed bit package length of 32 bit for physically parallel exchange and synchronisation over a 32 bit wide bus (Bit Package). The bit package exchange process uses a simple request/acknowledge handshake protocol (BP ack, BP req).
Tree Parity Machine Unit
The TPM unit comprises the logic for the TPM structure, such as the logic for calculating the parity bits as explained in Section 2. It consists of the TPM control, a Cyclic Redundancy Code (CRC) generator, a Parity Computation unit and a Weight Adjustment unit. A register bank holds the data for the hidden unit and the weights of the network as shown in Fig. 3 .
The TPM control is realized as simple finite state machine (FSM) which executes the initialisation of the TPM and the learning process with the bit package from the other party. The Parity Computation unit calculates the summation and the parity bit (Eq. (1) and 2) . The weight adjustment unit accomplishes the learning rule (Eq. (3) and (4)).
The CRC random generator generates the pseudo random bits for the inputs of the TPM. It is initialised by a vector which is equal for both parties. For the purpose of authentication, the initial value would have to be kept secret. 
Implementation and Performance
We designed and simulated parameterisable, serial and semi-parallel TPM Rekeying Architectures, using VHDL to implement an FPGA-and an ASIC-realisation. For both architectures we appoint the integer range L to 4, as explained in Section 3. In the serial architecture, the synaptic summation is performed by Time Devision Multiple Access (TDMA) of an L-bit adder, while the semiparallel form uses TDMA of six L-bit adders in parallel. The details of the TPMRA implementations (key length K · N · L, serial or semi/fully-parallel realisation) must be chosen with respect to the target environment, including the used parameters, the timing, the available channel capacity and the available chip-area, of course.
We realized a two party prototype system based on two XCV300E-8 (Virtex E) FPGAs from Xilinx in order to investigate and demonstrate the functionality of our architectures. Standard cell ASIC prototype realisations were built to verify the suitability of the TPMRAs for typical embedded system components. We chose K = 3 and varied N up to 49 for a resulting key size of 3 · 49 · 4 = 588 bit. This choice for N already allows a remarkable key length and still keeps the average synchronisation time low (cf. [9] ). The underlying process was a 0.18µ six-layer CMOS process with 1.8V supply voltage based on the UMC library. The design was synthesised using the Synopsys Design-Compiler and was mapped using Cadence Silicon Ensemble.
The area (Fig. 4a ) of the TPMRA realisations scale approximately linear (around one square-millimetre) due to the linear complexity of the adders. The serial TPM realisation consumes less area (i.e. less hardware resources). Note, that most of the area is consumed by the bit packaging, because of the necessary storage of the inputs for the learning (cf. Section 2.2).
Obviously, the achievable clock speed (Fig. 4b) in the serial variant is significantly higher than in the semi-parallel version. This is due to the necessity of a longer clock tree for the additional registers to store partial results.
Additionally, we established the throughput (i.e. keys per second) subject is log-scaled and shows average key exchange rates for a 588 bit key and a selection of typical channels with their capacities. All data refers to a UMC 0.18 micron six-layer standard cell process.
to the average synchronisation time of 400 iterations for different key lengths in Fig. 4c . We assumed the maximally achievable clock frequency with regard to each key length, which can be achieved by Digital Phase Lock Loop (DPLL), regardless of the systems clock frequency. Furthermore, we appointed the average synchronisation time of 400 iterations for all key lengths, although it is really always less than the size of the key (a worst-case scenario). This data refers to an idealised infinite channel bandwidth, neglecting the transmission delay. For key lengths smaller than approximately 180 bit, the serial TPMRA has a higher throughput (in the range of 2.5 · 10 4 to more than 4 · 10 4 keys per second) due to the higher clock frequency (Fig. 4b) . Beyond this point, the semi-parallel version achieves a higher throughput, exploiting the parallel computation. Figure 4d shows the same information, but for three real communication channels and their bandwidths, given a key length of 588 bit. The chosen logscale allows to see the small difference regarding the throughput (up to around 1000 keys per second) for an I 2 C and CAN -bus. Only for buses of higher bandwidth such as the P CI-bus, the two architectures show a significantly different throughput (reaching the kHz-range). In the case of an 32 bit P CIbus in burst mode, the theoretical maximum throughput (as in Fig. 4c ) can be achieved. We also considered other bus systems (e.g. packet based systems like WLAN). The results are similar, due to their small bandwidth in comparison to the P CI-bus. Obviously, the bottleneck is the underlying communication-bus, as it is also typical in other domains (processor-bus-bottleneck). Given a highspeed communication channel, the proposed key exchange and rekeying in the kHz-range allows us to use rather weak encryption algorithms (cf. Section 3), as the security may rely on fast rekeying. Of course, any other more sophisticated encryption algorithm like AES or 3-DES can also be used.
The achievable average key-exchange rates of the TPMRAs in the kHz-range, allow to increase the security through a feasible frequent key exchange. Short key lifetimes can be realized efficiently. Also, any successful online attack must at least achieve the same performance, requiring significant hardware expenses. This does not appear to be feasable. Using different keys for encryption and transmission of different blocks of data, increases the difficulty for an attack on the encrypted data.
Due to the small area in the range of one square-millimetre, we regard the field of application principally as an IP-core in embedded system environments. A particular focus can be smartcards or transponder-based applications such as RFID-systems and devices in ad-hoc networks [19] , in which a small area for cryptographic components is mandatory.
Summary and Outlook
We presented a solution for secure communication in embedded system environments via Tree Parity Machine Rekeying Architectures. Our investigations confirm the results as presented in [9] and stress the advantages of a hardware implementation. The silicon area lies within a square-millimetre and allows to exchange keys of practical size within about a millisecond. The proposed exchange in parallel to encryption-transmission-decryption also allows for efficient rekeying schemes and short key lifetimes.
Next to algorithmic extensions to further increase the security [21, 22, 15, 23] , architectural improvements or variants include a fully serial realisation with TDMA usage of a single TPM unit. This further decreases the area consumption but at the cost of an increase in necessary cycles for one output bit. A stream cipher variant, using output bits directly via the Blum-Blum-Shub bit generator, was suggested already in [9] and its implementation in hardware is particularly suited for streaming applications. The relatively small size of the TPMRAs allows an implementation in embedded systems with only small overhead. They are especially suited for devices of limited resources and even more in moderate security scenarios. Consequently, the integration of our architectures into such a system and its practical evaluation is subject to future work.
