Abstract-Physical Unclonable Functions (PUFs) are circuits designed to extract physical randomness from the underlying circuit. This randomness depends on the manufacturing process. It differs for each device enabling chip-level authentication and key generation [1] applications. We present a protocol utilizing a PUF for secure data transmission. Parties each have a PUF used for encryption and decryption; this is facilitated by constraining the PUF to be commutative. This framework is evaluated with a primitive permutation network -a barrel shifter [2] . Physical randomness is derived from the delay of different shift paths. Barrel shifter (BS) PUF captures the delay of different shift paths. This delay is entangled with message bits before they are sent across an insecure channel. BS-PUF is implemented using transmission gates; their characteristics ensure same-chip reproducibility, a necessary property of PUFs. Post-layout simulations of a common centroid layout [3] 8-level barrel shifter in 0.13 µm technology assess uniqueness, stability and randomness properties. BS-PUFs pass all selected NIST statistical randomness tests [4] . Stability similar to Ring Oscillator (RO) PUFs under environment variation is shown. Logistic regression of 100, 000 plaintext-ciphertext pairs (PCPs) failed to successfully model BS-PUF behavior.
I. INTRODUCTION
Encryption/decryption algorithms form the backbone of modern public key infrastructure which supports a broad set of activities such as e-commerce and digital currency. Mathematical cryptosystems such as RSA [5] can take millions of clock cycles. Even symmetric encryption/decryption through AES takes 10-20 clock cycles. Moreover, even though their security is predicated on a hard mathematical problem such as prime number factoring, a mathematical model exists for an adversary [6] . Physical unclonable functions (PUFs) source physical randomness of a silicon foundry with a potential appeal of unmodelable, physical functions. They have been used to generate unique physical identities, and to seed key generation. Such PUFs offer both inter-chip variability and same-chip reproducibility. The variability ensures that distinct devices produce different outputs given the same input. Reproducibility, on the other hand, is valuable for predictability and determinism in the device authentication behavior. As a result, PUFs based on complex physical systems provide significantly higher physical security over the traditional systems which rely on storing secrets in nonvolatile memory. In addition, special manufacturing processes are not required to produce PUF devices. This advantage makes PUF devices a cost-effective and reliable alternative to mathematical randomness sources.
So far, the use of PUFs in cryptography is somewhat limited -the most common being key generation or random number generation. Chen used analog circuits to support cryptography with some elements of PUF like randomness [7] . Choi et al. deployed a variant of arbiter PUF to replace symmetric encryption in RFID domain as an authentication mechanism [8] . This was based on the earlier work of Suh et al. that deployed PUFs for anti-counterfeiting in RFIDs [9] . Che et al. described another authentication protocol based on PUFs [10] . [11] developed an IoT communication protocol based on PUFs. [12] developed a code encryption engine based on PUFs for supporting a secure execution environment similar to AEGIS [13] . The key difference between a processor secure execution environment and general encryption is that for the former scenario the processor platform is both the source and destination for the communication. In a processor secure execution environment, both the sender and receiver have access to the same physical PUF on the same platform. However, for general encryption, this assumption is violated. Both the sender and receiver possess distinct and different PUFs. We show a general communication protocol based on commutative PUFs. Alice and obtains the message m. Message confidentiality is maintained by entangling message bits with physical randomness. The entangling process must be commutative so that the order of f Alice and f Bob can be changed. Decryption of entangled messages requires reversibility. The entangled message m must exhibit a nonlinear relationship with m; this makes it hard for an eavesdropper to learn m by examining intermediate messages.
II. COMMUNICATION PROTOCOL
The circuit design and encryption protocol enable the commutative, invertible, and non-linear relationship properties of messages. Section III describes a mechanism for BS- PUF-based encryption. The BS circuit design is detailed in Sections IV, V.
III. ENCRYPTION PROTOCOL
Encryption must entangle the physical randomness of BS-PUF with the message. Physical randomness is extracted by measuring the delay of message bits along a shift path. An XOR of the message bits and delay accomplishes entanglement; this allows for commutativity and reversibility.
A. Encrypting Large Messages
A BS-PUF uses an n-bit key as shift amount. This allows for a a 2 n -bit BS-PUF challenge (message) resulting in a 2 n -bit BS-PUF response. Alternately, one could view (n − bit key, 2 n − bit message) as a challenge. We take the former 2 n -bit challenge view in this paper. For a barrel-shifter, practical values for n are limited to be in the range 7 − 10 bits leading to a message block size of 128 − 1024 bits. This means that a method of entanglement/encryption for plaintexts greater than 2 n bits is needed. Entanglement could occur by serializing the blocks of plaintext at BS-PUF input and concatenating the generated ciphertexts. However, this approach reveals patterns in the plaintext; the same plaintext will always encrypt to the same ciphertext. This leaks information by allowing an adversary to identify plaintext patterns.
The technique of cipher block chaining (CBC) is typically applied in block ciphers such as AES [14] . Like AES, BS-PUF encrypts a fixed number of plaintext bits. Thus, it can be viewed as a block cipher. A practical barrel-shifter or permutation network implementation could consist of 128-1024 bit blocks. Alice hoping to recover the message. Unfortunately, f −1 does not subtract delay from the correct bit in (5), (7); the correct message is not received by Alice. This scheme fails to be commutative. c 0 is an initialization vector (IV). This IV must be updated with each message; otherwise the same plaintext will encrypt to the same ciphertext. This would again allow an eavesdropper to identify patterns. Unlike traditional CBC algorithms, IV for BS-PUFs based encryption does not need to be public because ciphertext will be sent back to sender for decryption. It could be generated with any PUF, e.g. SRAM PUFs [15] .
Decryption utilizes BS-PUF's inverse. p i is recovered by the reverse process. Ciphertext c i is given to the inverse BS-PUF operation. The ⊕ of the output and c i−1 is then taken. Thus, decryption of the i th block is
Message encryption requires a secret key. The key determines the bit shift path; it is used as shift amount. The BS-PUF response depends both on the challenge (plaintext) and the key. The key does not change as frequently as the plaintext does.
Some of the desirable characteristics of BS-PUF are as follows. BS-PUF is fast. Encryption takes multiple rounds with a traditional block cipher. BS-PUF makes only one pass through the shifter or permutation hardware.
B. Single Block Encryption
In this subsection, several permutation schemes are discussed for single block encryption. Fig. 4 . Sharing a key allows both parties to perform the same permutation. This ensures the delay is subtracted from the correct bit when performing the inverse f −1 P U F l for l = 1, 2. Entropy is added into public message by bit shifting.
1) Asymmetric Key Encryption:
Encrypting without a shared key is ideal.
Section II dictates invertibility and commutativity as communication protocol requirements.
PUF f must be a one-to-one function to achieve encryption and invertibility for decryption. Many classical PUFs, such as RO-PUFs [16] , [17] , [18] , [19] and arbiter PUFs [20] , [21] , cluster the challenges into equivalence classes on a set of attributes resulting in the same response per challenge equivalence class. Arbiter PUF uses relative bit arrival time as the clustering attribute. RO PUF uses relative oscillator frequencies. The end result is that this makes these PUFs not invertible, since the mapping is many-to-one.
Further note that physical invertibility is distinct from logical invertibility. A mathematical one-to-one function has logical invertibility, but may not be physically invertible. Physical invertibility is applicable to the PUF physical attribute measurement process. In the forward computation, inputs traverse the computation paths to the output; physical measurements may take place at various points along these paths. In the inverse computation, output bits travel to the inputs through the identical computation paths in reverse. The physical measurements of the same physical attribute occur in the inverse computation. These forward and inverse physical measurements need to be reproducible at all measurement points from input to output.
Permutation functions provide the necessary one-to-one relationship. Permutations create a non-linear relationship from input bits to output bits. Due to this property, an adversary cannot create a useful mathematical model describing the input, output relationship. For a n-bit data, there exist N = n! permutations denoted by π 0 , π 1 , ...π N −1 . Each π i captures some permutation (i 0 , i 1 , . . . , i n−1 ), where bit k → i k . In other words, the bit at 0 is routed to bit position i 0 in the output. A key K is used to select this mapping. We call this Fig. 5 . Invertible and Commutative PUF protocol: P U F 1 (f Bob ) and P U F 2 (f Alice ) illustrate the PUF composition and how barrel shifter PUF is used for encryption and decryption processes. Assume both P U F 1 and P U F 2 are two stages BS-PUFs, key 1 (P U F 1 ) is (1, 0), key 2 (P U F 2 ) is (0, 1). For P U F 1 , bit x 0 (x 1 ) goes to output bit position y 1 (y 2 ). The encrypted bit output at
)m is the mth least significant bit of the delay from input bit i to the output bit i . Permutator is added after each PUF to shift each bit back to its original position after encryption. a keyed PUF:
The PUF response is derived from the shift path delay.
The protocol requires the entanglement procedure to be commutative. Entanglement adds a bit from the delay of each path to the plaintext. Thus, entanglement is expressed as f (K Bob , P i ) = P i ⊕ D Bob . This is commutative because '⊕' is commutative. Note that the entanglement between the physical delay attribute and logical bits can occur at multiple points during the flight of message bits from input to output; each measurement point is also an entanglement point.
Our first version of encryption protocol is based on invertible and commutative PUFs. Invertibility requires using a raw physical property like delay. The reversible computation principle states that any information loss makes a process irreversible [22] . Many PUFs derive their response through the comparison of physical properties. Arbiter PUF uses a race between two paths. RO-PUF uses a frequency comparison. These comparisons provide reproducibility by including a wide margin of noise before comparison output changes, but information is lost.
The proposed PUF is based on a barrel shifter. Constructing it with precisely sized transmission gates makes its delay independent of bit state 0 or 1. Bit propagation delay for forward path and inverse path is remarkably stable and consistent regardless of bit state. This is due to symmetric physical structure of MOSFET's source and drain. As we discuss in the following, physical commutativity and invertibility in our protocol is only achieved if the physical delay on the paths is bit state independent. The Step 5 of Fig. 1 when Bob computes f
−1
Bob is dealing with a different bit pattern at the output of Bob's PUF than what was computed in Step 1 at Bob's PUF's output. This is because the Step 5 bit pattern has an additional permutation applied to it by Alice, which is not known to Bob. An alternative implementation could have used pass transistors. However, it is hard to equalize the delay for 0 and 1 through a pass transistor. Thus, transmission gates are used to make the delay plaintext independent.
Asymmetric key encryption protocol in Section II is based on invertible and commutative BS-PUFs; which are defined as follows: Invertible PUF: An invertible keyed PUF f on input x and key K:
is computed on the same PUF in the reverse direction. Note that the PUF function f entangles a logical component and a physical component, and both need to be invertible. PUFs designed to be used directly for encryption need two input sequences: (1) key for response function selection as in a permutation selector, (2) plaintext to be encrypted. Commutative PUF: Assume there is a composition of two commutative PUFs PUF1 and PUF2. This means P U F 2(P U F 1(x)) = P U F 1(P U F 2(x)). Note that both logical and physical commutativity are needed for such a commutative PUF. For BS-PUF, the entanglement function must be commutative for physical commutativity in addition to the physical measurements being the same in P U F 2(P U F 1(x)) and P U F 1(P U F 2(x)); this requires the physical measurements to be invariant of the bit state. The physical measurements are completely defined by the key K for a given PUF.
2) Protocol Without Permutation: In the first version of design, each PUF f P U F1 and f P U F2 is a permutation network keyed by key 1 and key 2 respectively. Key key 1 selects a permutation π key1 from a large set of possible permutationsKeccak permutation [23] , [24] could be used for instance. The implementation, however, needs to be physically and logically reversible consisting of transmission gates. We assume that for a permutation π key1 which maps ith input bit to the i th output bit and jth input bit to j th output bit, we capture the exact delays for each input-output path. Let D(i, i ) denotes the delay of the path from input i to output i for π key1 in f P U F1 . Let D(j, j ) be defined likewise. We will describe how we can capture these delays by using timer capture and edge detector functions in Section V.
For each PUF, the output bit y i can be expressed as an entanglement function e(x π −1
Here e is an entanglement function between the bit routed to output j (x π key (j), j) to do encryption at the jth output bit, we expand the n-bit input to an nk-bit output. Assuming we want to retain the same output resolution of n-bits, one option would be to perform an XOR (⊕) of the mth bit of D(π Let us assume that the delays of the permutation function π key1 in f P U F1 are denoted by D(π −1 key1 (j), j) for a path from input π −1 key1 (j) to output j and the delays of the permutation function π key2 in f P U F2 are denoted by d(j, π key2 (j)) for a path from input j to output π key2 (j).
The mth least significant bit of P U F 2 's delay captured by the d function is XORed with f P U F1 's output.
Clearly, the RHS of expression
is commutative due to commutativity of operator ⊕ -it does not matter whether f P U F1 is applied first or f P U F2 is applied first. However, this commutativity statement is only correct for a specific bit routing, but incorrect for encrypted data.
Consider
. By going over the communication protocol in Fig. 1 step by step, a defect becomes apparent. Thecomplete verification process is shown in Fig. 3 .
In the following analysis, permutations are abbreviated according to output positions for simplicity. e.g. (0 → 1, 1 → 2, 2 → 3, 3 → 0) is abbreviated to (1, 2, 3, 0). Assume π P U F1 = (1, 2, 3, 0) and π P U F2 = (2, 3, 0, 1).
• Step 1: ing in
This logical result is correct in routing x i back to the ith bit position, but the physical delay terms are completely mixed up and do not cancel each other.
3) Protocol With Permutation: In order to ensure the correct routing and commutativity, we modify the original permutation protocol by adding a permutation after each PUF. The primary function of this permutation is routing x i back to the ith position from position π key1 (i) before sending the message at the end of Step 1. The complementary key, key 1 , that results in the permutation π to restore the orginal message bit order is the only function of this permutation. No delay is added.
An example of this protocol is shown in Fig. 5 with the following detailed description.
Before sending it to Alice, Bob's complementary permutation, called permutator in Fig. 5 is applied to generate
In this new permutation protocol, the logical permutation does not add to the confusion at all unlike in AES or Keccak protocols. Confusion is achieved from the permuted physical delay properties of the PUF. Which Path delay bits are combined with each input bit is still hidden (through confusion) from the adversary through key driven π.
Decryption follows a similar process. However, the direction of message transmission is reversed and the inverse permutations are used. This is where physical invertibility helps recover the original forward delay vector in the reverse direction. Thus, (1, 2, 3, 0)(2, 3, 0, 1)(x 0 , x 1 , x 2 , x 3 )) is rearranged by Bob's permutator first. This is ( 
D(3,
Alice is applied. First, Alice's permutator will rotate the bits giving (
. Rotated bits are then given to P U F 2 in the reverse direction resulting in
The delay terms cancel. Alice receives the original message (x 0 , x 1 , x 2 , x 3 ) sent by Bob. In order to eliminate this problem, BS-PUF must permute bits in public messages, which we could not do and yet preserve commutativity and invertibility. One possible solution that allows permuted public messages while preserving commutativity and invertibility is to let Bob and Alice share the same key. The corresponding protocol is shown in Fig. 4 .
In the shared key protocol, Bob permutes the input message with π K entangling it with his delay. Alice reverses the permutation using π −1 K entangling it with her delay. Note that the shared key is K. The bits are in their original positions in the message sent to Bob for decryption. Note that the entanglement with both PUFs' delays protects this message. The delay will be un-entangled from the correct bits in the subsequent decryption steps. The bit order is different in the message from Bob to Alice versus in the message from Alice to Bob. This avoids linear leakage of information in XOR based equations on these two messages.
Details of the shared key scheme presented in Fig. 4 are as follows.
• Step 1: Bob permutes x 0 , x 1 , x 2 , x 3 with π = (1, 2, 3, 0) and gets
It is sent to Alice without any further bit level routing; this achieves bit-level confusion of the public message.
• Step 3: f Alice performs the reverse permutation π −1 of f Bob and simultaneously applies Alice's delay (π −1 = (3, 0, 1, 2) ). After f Alice is applied, all bits are rotated back to their original position but each bit is encrypted with two physical delay values. In this example, after applying f Alice we get
Bob is applied. Permutation π is applied again and delay added in Step 1 is cancelled by XOR. Then message sent to Alice is converted to ( Fig. 10 . The path delay capture unit tests for and stores the path delay. The edge detector detects an output transition; S equal to output will not be detected. Consequently, the transmission path receives S and S successively; a transition at output is guaranteed.
Alice is applied, bit positions are rotated back again, and delay added in Step 3 is cancelled by XOR. The message from the previous step is converted to
, which equals the original message x 0 , x 1 , x 2 , x 3 . Evaluating all messages crossing the insecure channel,
, no linear relationships exist among any pairs of messages that yield information to a man-in-the-middle. No duplicate delays appear at any bit position. There is no way to retrieve original message from the in flight messages without the shared key and access to Bob and Alice's PUFs.
All messages are protected while traversing the insecure channel. The permutation applied by Bob protects the first message as it travels to Alice. Entanglement with both Alice and Bob's delay protects Alice's response. The permutation then protects the final message from Bob to Alice.
IV. BARREL SHIFTER PUF DESIGN
We evaluate a barrel shifter as a potential invertible and commutative PUF. The block diagram of a barrel shifter is shown in Fig. 6 . For simplicity, only two shift levels are shown.
Output Logic is added to capture path delay D(i, i ). A Event
Counter is initialized to 0. The RST signal simultaneously starts the Event Counter and releases the input message. The delay is captured by reading the Event Counter when the Output Logic detects a transition. Finally, the entanglement block in Output Logic entangles delay information (LSB or 2nd LSB of delay) with the output bit.
Each shift stage is logically similar to an arbiter PUF [25] stage.
Key bits determine the shift amount s = k i=0 (key i * 2 i ). Thus, key i is applied from LSB to MSB, from left to right. The key determines the shift amount. For example, in diagram in Fig. 6 , key = {0, 1} encodes for right shift by 2 in the second stage. Consequently, Input 0 traverses a different path; provides a different delay results with different keys.
The delay variation is generated by transistor-level mismatch [26] and doping variability [27] . Variation accumulates over several stages. It is then significantly large to be detected by the Output Logic.
BS-PUF must be invertible; this property facilitates decryption. Consequently, the physical delay measurements must not depend on the bit state; they should be a function only of the path.
V. CIRCUIT IMPLEMENTATION
A commutative PUF based on a barrel shifter is implemented in hardware. Transmission gates implement the shift paths. The circuit is subdivided into 3 components: input logic, shift unit and output logic.
A. Input logic
Input logic is used to trigger the delay test system. It is a 3-input, 1-output circuit that connects the input signal S or its inverse S to output terminal (Fig. 7) . Input logic consists of three transmission gates. RST (reset) is used to control ON/OFF status of the first transmission gate. When RST is high, Input travels through the first gate and arrives at an intermediate node. Otherwise, it is blocked. REV (reverse) determines whether Input is inverted. Input will be inverted when REV = 1. The function definition for input logic is: output = RST • (REV ⊕ input).
B. Shift unit
Shift units implement the path selection and form shift stages. Shift unit size determines the magnitude of delay. We construct a barrel shifter with 8 shift stages for testing. Each layer 256 contains shift units. Each stage shifts by either 2 7−n or 0 where n is the stage index.
Each shift unit is a 4-input, 1-output circuit show in Fig. 8 . Either inputA or inputB is mapped to ouput. The mapping is determined by the key. A key value of 1 causes the upper transmission gate to open; output then becomes inputA. Otherwise, output becomes inputB.
The path delay value should vary depending on the shift path. Path delay primarily depends on shift units' transmission gates. Adding additional load capacitance after each transmission gate or accumulating variation over several stages of transmission gate enlarge the delay; it becomes detectable by the path delay counter.
In BS-PUFs, PUFs uniqueness depends on how much delay variation could be provided by same path on different chip.
Modifying transistor area is the main method for increasing the inter-chip variation. Transistor delay variation is inversely proportional to transistor area [28] . Sizing transistors smaller results in increased delay variation. However, BS-PUF requires plaintext independent path delay. Path delay for a 1-valued bit compared to a 0-valued bit differs for minimum transistor sizes.Hence larger transistors are used in shift units.
C. Output logic
Output logic measures/captures path delay. Output logic for each bit contains 3 parts: counter, edge detector trigger generator and entanglement logic (Fig. 9(d) ).
Counter takes CLK and RST as input producing a 10-bit output; it counts the number of rising edges of CLK. Setting RST high resets the counter to 0. The path delay is expressed as (input clock period) × (counter value).
Edge detector trigger generator generates a pulse in response to at transition at its input. it includes an edge detector ( Fig. 9(b) ) and a positive edge trigger generator (Fig. 9(c) ). Edge detector converts a rising or falling edge into a rising edge at its output. Positive edge trigger generator converts the rising edge from edge detector into a pulse.
The output logic works as follows. First, a rising/falling edge at input produces a pulse at edge detector trigger generator output. This pulse enables the transmission gate in Fig. 9(d) for a short time period (2ns). During this time, counter output is captured; it must not change while being captured. Thus, enable time period must be shorter than clock period (4ns). Entanglement logic extracts the mth LSB of delay D(i, i ). Computing XOR of this bit with the input signal x i results in the entangled output bit.
The output logic works by detecting a transition. An transition occurring depends on the previous output value. Thus, the output logic is incapable of detecting unchanging output values. An output transition is forced by providing x i before x i at the input.
D. Path Delay Testing
The input logic, shift unit and output logic work together to capture the path delay. The following five steps are necessary. 1) Set x i as input and reset input logic. 2) Wait for x i to arrive at output logic.
3) Reset input logic and clock counter, set x i as input. 4) Wait as x i travels the path determined by key triggering a transition at the output logic. 5) Encrypt using the captured counter value.
VI. POST-LAYOUT SIMULATION RESULT
The entanglement logic utilizes a 1-bit result from the path delay. The path delay capture logic provides a multiple-bit delay counter. One bit must be chosen; it must be shown to have the requisite properties for BS-PUF: (1) inter-chip variability, (2) intra-chip reproducability, (3) randomness, (4) commutativity. Cadence Spectre simulations are used to generate raw delay data. Delay variability assessment is conducted by 3σ Monte Carlo sampling over process parameters. This test uses IBM 130 nm PDK. A common centroid layout is employed to reduce linear gradient errors [29] .
We construct an 8-level barrel shifter accepting a 256-bit input with a 256-bit output. Output logic similar to input capture logic in [30] detects output voltage changes. Voltage transitions send a control signal to a counter. Path delay is captured at the resolution of the counter's clock period; a period of 4ns is used. Delays must be a reasonable multiple of the clock period to express variation.
In the following experiments, we primarily focus on raw data: (1) Monte Carlo sampling 200 times on the path from input 0 to output 16 (2) Monte Carlo sampling 200 times on all 256 paths with no shifting.
A. Inter-chip Variability
Shift path delay is a function of the silicon fabrication process; it potentially exhibits PUF properties. Each shift path terminates with entanglement logic requiring one bit. A bit from the delay counter must be selected. The chosen bit must exhibit sufficient variation.
Monte Carlo simulation captures single path delay variability as a proxy for inter-chip delay variability. As shown in Fig. 11 , in 200 Monte Carlo samples for process parameters perfromed on path x 0 → y 16 , the delay ranges from 85 ns to 145 ns with an average around 120 ns. It is a ±25% (±30-ns) variation. Counter output varies about ±8. This indicates that roughly the least significant 3 bits of delay have significant entropy in inter-PUF measurements. Thus, the LSB, 2nd LSB, and 3rd LSB are candidates for entanglement.
B. Inter-chip Uniqueness
The chosen path delay bit must exhibit inter-chip uniqueness. This requires significant variance between responses on different chips. Pair-wise hamming distance (HD) is a criterion that measures variability.
The HD of 200 path delay samples of 256-bit responses is computed. Table III shows distribution of inter-chip HD for LSB. Similar figures are given for 2nd LSB in Table IV . For LSB, the mean HD is 127.99 bits with a standard deviation of 8.04 bits. For 2nd LSB, these values are 128.01 bits and 7.99 bits, respectively. HD 128 means roughly 50% of the response bits differ. It is maximally unlikely that two BS-PUFs will generate the same output.
C. Intra-chip Reproducibility
The usefulness of a single PUF relies on it producing a consistent response to a challenge; they should be independent from the environment. Tests are performed subjecting BS-PUF to: (1) temperature variation (2) voltage supply variation. The frequency of response bit flips is quantified.
Bit flip rate is frequency a bit changes from 0 → 1 or 1 → 0. It is computed relative to some baseline response. Gathering responses at common room temperature (25
• C) and supply voltage (5V ) establishes this baseline. The percentage of path delays where a bit flips is the bit flip rate. For example, the LSB flipping in 64/256 paths represents a 25% bit flip rate.
BS-PUF retains a bit flip rate smaller than 18% under environment variation. This is similar to the flip rate of traditional RO PUFs [31] . counter logic increments at 4ns frequency; a ±1 bit change in path delay is expected. Knowing how temperature variation affects the chosen entanglement bit is ideal; bit flip rate quantifies this. It is computed in response to temperature variation, shown in Fig.  12 . Vertical bars represent bit flips for LSB (blue) and 2nd LSB (green). 2nd LSB flip rates is under 12% while LSB's flip rate is significantly higher. Thus, the 2nd LSB provides better reproducibility. Bit flip rate is computed in response to voltage variation, shown in Fig. 13 . Flip rates for the 2nd LSB is under 18% while LSB rates are significantly higher. The 2nd LSB again provides better reproducibility; it is the best candidate for the entanglement bit.
A higher order bit could be selected. It would have comparatively better flip rates, but reduced variability. Many mature techniques exist to compensate for temperature and voltage variation [32] , [33] . These techniques operate at the flip rates expressed by LSB and 2nd LSB. Thus, the advantage of choosing a higher order bit is minimal.
D. Randomness
Output of a good PUF should look like a pseudo-random number generator so that an attacker cannot model it easily. Assessing randomness performance of BS-PUF uses data from Monte Carlo sampling of path delays. Delay values are converted to binary responses by extracting the m th LSB bit from the delay. Each 256-bit response (one bit from each path) is examined using NIST statistical test suite. Table I and Table II give the detailed test results for LSB and 2nd LSB of the BS-PUFs output. The minimum pass rate for each statistical test is 193 for a sample size of 200 binary 
E. Commutativity
Encryption and decryption rely on function composition. Decrypting a message encrypted by both self and another party is required. The other party may have changed the bit values (0 or 1). Thus, Delay variation must be independent of the bit value. An input of 1 must have the same path delay as an input of 0.
BS-PUF path delays depend only on the permutation key. Shift units are sized to achieve balanced pullup and pulldown resistance. Transmission gate NMOS sizing is
Two tests are performed to verify pullup and pulldown variability.
1) Testing rising/falling edge delay in four different (FF, FS, SF, SS) process corners. Transmission time difference for 0 and 1 must be smaller than the counter period (4ns). 2) Performing Monte Carlo sampling of path delay for inputs 0 and 1. Delays are recorded for all paths without bit shifting. No bit flips should occur in the path delay. Maximum transmission time difference for 0 and 1 is 2.34ns; this is much smaller than the 4ns clock period. Consequently, no path delay bits flip in Monte Carlo sampling.
VII. PERFORMANCE EVALUATION

A. Modeling Attack
According to [34] , all examined Strong PUFs under a given size can be modeled with machine learning with success rates above their stability in silicon. Consider the barrel shifter in our communication protocol to be a black box. Attackers know nothing about the key and physical delay of barrel shifter. An attacker should not be able to model the relationship from input bits to the output bits. Such a model provides an eavesdropper information about the plaintext given a ciphertext.
To investigate the resilience of BS-PUFs against modeling attacks, various ciphertexts are generated with different keys and plaintexts for training and cross-validation.
Logistic Regression (LR) [35] and Evolution Strategies (ES) [36] , [37] are commonly used to model PUF output. ES is specialized to modeling PUFs under noisy conditions [34] ; it does not apply when voltage supply and temperature are certain.
Thus, only LR modeling is performed. Since the error rate of machine learning prediction decreases with the size of training set, LR modeling is tested for LSB response and 2nd LSB response with a variety of training sets with different sizes.
Monte Carlo Sampling [38] utilizes randomness to gen- erate n challenge response pairs (CRP). n random keys, K = {K 0 , K 1 , . . . , K n } are generated. Responses, R, are generated by entangling plaintext, P , using these keys, R i = BS − P U F (P, K i ). Note that the response is the shift path delay; this is dependent on the key only. Hence, the plaintext need not be modified. This random CRP sample is assumed to be representative of the distribution of all CRP. Simulating BS − P U F (P, K i ) requires computationally expensive Cadence Spectre simulations. An efficient method for computing R i given K i is needed. Thus, we apply Monte Carlo Sampling to create a delay matrix, D, modeling the delay of all shift paths. The delay of each shift unit is recorded. Path delay is then computed by: (1) summing the delay of all shift units along a path, (2) dividing it by 4ns capture logic resolution, (3) extracting LSB or 2nd LSB. Thus, D enables computations of path delays given K i . For example, Eq. (1) is a sample delay matrix for 4 inputs, 2 stage BS-PUFs. d i,j represents exact delay value of top and bottom transmission gates in ith row, jth column shift unit.
Plaintext-ciphertext pairs (PCP) are computed using D. For the delay matrix in Eq. (1) using a key = {1, 0} encoding for right shift in the first stage, the plaintext (i 0 , i 1 , i 2 , i 3 ) generates the response in Eq. (2) .
This process makes extraction of all possible PCP feasible.
For a BS-PUF with an input message length of 256-bit, there are 2 256 possible input messages. There are 8 stages with 2 8 possible keys. It is infeasible to generate all 2 264 PCPs. Linear Regression (LR) is performed with a training set of size n = {10, 100, 1000} PCPs per key. To obtain a representative sample of PCPs, responses are computed with 100 keys and 10, 000 plaintexts. PCPs not part of the training set are used for cross-validation. Scalability experiments are conducted on a 6-stage, 64-bit input BS-PUF; delay matrix of this BS-PUF is the top left 64 × 6 sub-matrix of the 8-stage delay matrix Table V and Table VI show the prediction accuracy of LR on LSB and 2nd LSB. LR is implemented by an iterative program written in Matlab. The regression coefficients' initial values are set to (0, 0) in all LR applications. Silicon stability of BS-PUFs is 75%. Thus, all modeling reaching a higher prediction rate should be considered a success.
LSB provides better result than 2nd LSB. LR achieves 79.5% prediction rate for 6-stage BS-PUF 2nd LSB output. If 2nd LSB is used as the delay bit, then LR can successfully model 6-stage BS-PUF with sufficient number of PCPs. On the other hand, with the same modeling process, LSB cannot be modeled even with a large number of training samples. This is expected as the LSB is inherently more variable. Consequently, the choice to use LSB or 2nd LSB for the delay bit presents a tradeoff between security and reproducibility; LSB provides the former while 2nd LSB provides the latter.
B. Speed Performance
One of the most important advantages of BS-PUFs based encryption is its faster encryption than other traditional symmetric encryption schemes, such as AES. In this section, comparison is made between BS-PUFs encryption and AES.
BS-PUF based encryption outperforms conventional AES implementations. Some exceptions relying on high-speed crypto processors and architectures exist [39] . Performing AES Encryption on a modern Intel Pentium Pro processor requires 18 clock cycles per byte. Decryption takes even more cycles with a conservative estimate of 36 clock cycles per byte for encryption/decryption round-trip. This time increases as the block size increases. Comparatively, BS-PUF-based encryption (1.6 clock cycles per byte per BS-PUF resulting in 6.4 clock cycles for both encryption/decryption) is an order of magnitude improvement. In addition, BS-PUF-based encryption scales better, because encryption delay is nearconstant (log n delay for block size n) regardless of block length.
This work proposes a protocol for data transmission using BS-PUF. It necessitates multiple-message round transaction between sender and receiver. This incurs transmission overhead.
The BS-PUF protocol has advantages over AES in encryption speed.
C. Area Needs
BS-PUF does very little mathematical computation; protection is provided by the physical properties of the encryption device. Little area is required due to this simplicity. In comparison, AES performs many more computations requiring greater area.
According to [40] , 32-bit FPGA-based AES encryption contains 8, 300 2-input NAND gate equivalents. A 32-bit BS-PUF requires 2, 400 transistors, which is 600 2-input NAND gate equivalents. This evaluation is not technology dependent.
VIII. FUTURE WORK
Much needs to be addressed to establish the practicality of commutative PUFs. An evaluation of PUFs based on more relevant permutation families such as Keccak sponge family [23] , [24] is needed. Overhead of reversible implementations, which also offer invertibility, of these functions need to be assessed. With invertibility, asymmetric encryption is also feasible. We are exploring asymmetric encryption direction. Another important research direction is quantification of security offered by BS-PUF vs AES.
The impact of PUF noise requires more discussion. The proposed design uses raw PUF responses; it will therefore be noisier than traditional PUFs. An error coding scheme using helper data and some form of fuzzy extraction is required.
Once we have designs for a realistic permutation family, similar evaluations are needed for their robustness. Path delay distributions across chips need to show variability and uniqueness; within the same PUF different paths need to show variability and randomness; temperature and supply voltage caused delay variation needs to be small enough. In addition, resource needs for these implementations need to be evaluated in terms of area, time and energy. The timer for input capture may impose an insignificant overhead. Its accuracy plays a central role in feasibility of BS-PUFs.
IX. CONCLUSIONS
In this work, we explore variety of encryption protocols based on commutative PUFs and propose a circuit implementation of the required commutative PUFs (BS-PUF). Commutativity relies on symmetric delays in forward and backward paths regardless of the message bit state. Spectre Monte Carlo simulations indicate only less than 1 bit delay offset is introduced by plaintext bit state variation. This ensure the commutativity of the system. Simulation shows that inter-chip variability (up to ±25% chip-to-chip variation) is acceptable. These encryption PUFs have potential to root the encryption in hardware, hence increasing robustness beyond current software only solutions.
Asymmetric encryption methods are valued for their ability to establish a secure communication channel in the absence of a priori shared secret. Such methods require complex computations resulting in low throughput compared to symmetric encryption. BS-PUF has the potential to provide an asymmetric encryption method with performance similar to AES (symmetric encryption).
Basing encryption in hardware limits the attack surface. An adversary cannot retrieve the message even when both encryption key and ciphertext are known; information about the PUF behavior is not available to them. The behavior of the encryption function becomes a secret. Thus, more entropy is added to the system. Besides, BS-PUF based encryption provides much better speed and area performance than AES.
