Protocols for secure computation enable mutually distrustful parties to jointly compute on their private inputs without revealing anything but the result. Over recent years, secure computation has become practical and considerable effort has been made to make it more and more efficient. A highly important tool in the design of two-party protocols is Yao's garbled circuit construction (Yao 1986), and multiple optimizations on this primitive have led to performance improvements of orders of magnitude over the last years. However, many of these improvements come at the price of making very strong assumptions on the underlying cryptographic primitives being used (e.g., that AES is secure for related keys, that it is circular secure, and even that it behaves like a random permutation when keyed with a public fixed key). The justification behind making these strong assumptions has been that otherwise it is not possible to achieve fast garbling and thus fast secure computation. In this paper, we take a step back and examine whether it is really the case that such strong assumptions are needed. We provide new methods for garbling that are secure solely under the assumption that the primitive used (e.g., AES) is a pseudorandom function. Our results show that in many cases, the penalty incurred is not significant, and so a more conservative approach to the assumptions being used can be adopted.
1. INTRODUCTION
Background
In the setting of secure computation, a set of parties with private inputs wish to compute a joint function of their inputs, without revealing anything but the output. Protocols for secure computation guarantee privacy (meaning that the protocol reveals nothing but the output), correctness (meaning that the correct function is computed), and independence of inputs (meaning that parties are not able to make their inputs depend on the other parties' inputs). These security guarantees are to be provided in the presence of adversarial behavior. There are two classic adversary models that are typically considered: semi-honest (where the adversary follows the protocol specification but may try to learn more than is allowed from the protocol transcript) and malicious (where the adversary can run any arbitrary polynomial-time strategy in its attempt to breach security).
Garbled circuits. One of the central tools in the construction of secure two-party protocols is Yao's garbled circuit [19, 23] . The basic idea behind Yao's protocol is to provide a method of computing a circuit so that values obtained on all wires other than circuit-output wires are never revealed. For every wire in the circuit, two random or garbled values are specified such that one value represents 0 and the other represents 1. For example, let i be the label of some wire. Then, two values k . Of course, the difficulty with such an idea is that it seems to make computation of the circuit impossible. That is, let g be a gate with incoming wires i and j and output wire . Then, given two random values k b i and k c j , it does not seem possible to compute the gate because b and c are unknown. We therefore need a method of computing the value of the output wire of a gate (also a random value k 0 or k 1 ) given the value of the two input wires to that gate. In short, this method involves providing "garbled computation tables" that map the random input values to random output values. However, this mapping should have the property that given two input values, it is only possible to learn the output value that corresponds to the output of the gate (the other output value must be kept secret). This is accomplished by viewing the four possible inputs to the gate, k 0 i , k 1 i , k 0 j , and k 1 j , as encryption keys. Then, the output values k 0 and k 1 , which are also keys, are encrypted under the appropriate keys from the incoming wires. For example, let g be an OR gate. Then, the key k 1 is encrypted under the pairs of keys associated with the input values (1, 1), (1, 0) and (0, 1). In contrast, the key k 0 is encrypted under the pair of keys associated with (0, 0).
Fast garbling and assumptions. Today, secure computation is fast enough to solve numerous problems in practice. This has been achieved due to multiple significant efficiency improvements that have been made on the protocol level, and also due to garbled circuits themselves. Many of the optimizations to garbled circuits -described below -come at the price of assuming strong assumptions on the security of the cryptographic primitives being used. For example, the free-XOR technique requires assuming circular security as well as a type of correlation robustness [7] , the use of fixed-key AES requires assuming that AES with a fixed key behaves like a public random permutation [5] , reductions in the number of encryption operations from 2 to 1 per entry in the garbled gate requires correlation robustness (when a hash function is used) and a related-key assumption (when AES is used).
Typically, the use of less standard cryptographic assumptions is accepted where necessary, especially in areas like secure computation where the costs are in general very high. However, in practice, solid cryptographic engineering practices dictate a more conservative approach to assumptions. New types of elliptic curve groups are not adopted quickly, people shy away from non-standard use of block ciphers, and more. This is based on sound principles, and on the understanding that deployed solutions are very hard to change if vulnerabilities are discovered. In the field of secure computation, the willingness to take any assumption that enables a faster implementation stands in stark contrast to standard cryptographic practice. In this paper, we propose to pause, take a step back, and ask the question how much do nonstandard assumptions really cost us and are they justified. We remark, for just one example, that practitioners have warned against assuming that AES is an ideal cipher, due to related key weaknesses that have been found; see e.g., [4, 6] . Furthermore, the security of AES with a known key was studied in [14] , and the results show that the security margin for using AES in this way is arguably not as high as we would like. In particular, [14] present an algorithm that distinguishes 7-round AES with a fixed key from a public random permutation, in time 2 56 and little memory. As in most situations, if the benefit is huge, then more flexibility with respect to the assumptions is justified, whereas if the gains are smaller then a more cautious approach is taken.
The focus of this paper is to study how much is really gained by relying on non-standard assumptions and to provide optimizations that require assuming nothing more than that AES behaves like a pseudorandom function.
Known Garbled Circuit Optimizations
Before proceeding to describe our work, we present an overview of the most important efficiency improvements to garbled circuits:
• Point and permute [21] : In order to prevent the garbled circuit evaluator from knowing what it is evaluating, the original construction randomly permuted the ciphertexts in each garbled gate. Then, when computing the garbled circuit, the evaluator tries each ciphertext in the gate until one correctly decrypts (this requires an additional mechanism to ensure that only one ciphertext decrypts to a valid value). On average, this means that 2.5 entries need to be decrypted per gate (where each costs 2 decryptions). The point and permute method assigns a random permutation or signal bit to each wire, that determines the order of the garbled gate. Then, the encryption of a garbled value includes the bit needed to enable direct access to the appropriate entry in the garbled table (given two garbled values and the two associated bits). This reduces the number of entries to decrypt to 1 (and thus 2 actual decryptions).
• Free XOR [16] : The garbled circuit construction involves carrying out encryptions at every gate in the circuit, and storing 4 ciphertexts. The free-XOR method enables the computation of XOR gates for free (the computation requires only computing 1-2 XORs, and no ciphertexts need be stored). This is achieved by choosing a fixed random mask ∆ and making the garbled values on every wire have fixed difference ∆ (i.e., for every i, the garbled values are k 0 i and k
In many circuits, the number of XOR gates is very large and so this significantly reduces the cost (e.g., in the AES circuit there are approximately 7,000 AND gates and 25,000 XOR gates; in a naive 32x32 bit multiplier circuit there are approximately 6,000 AND gates and 1,000 XOR gates [1] ).
We remark that the free-XOR method is patented, and as such, its use is restricted [17] .
• Reductions in garbled-circuit size [21, 22, 24] :
Historically, the most expensive part of any secure protocol was the cryptographic operations. However, significant algorithmic improvements to secure protocols together with much faster implementations of cryptographic primitives (e.g., due to better hardware) have considerably changed the equation. In many cases, communication can be the bottleneck and thus reducing the size of the garbled circuit is of great importance. In [21] , a method for reducing the number of garbled entries in a table from 4 to 3 was introduced; this is referred to as 4-to-3 garbled row reduction (or 4-3 GRR). This improvement is achieved by "forcing" the first ciphertext to be 0 (by setting the appropriate garbled value on the output wire so that the ciphertext becomes 0). In [22] , polynomial interpolation was used to further reduce the number of ciphertexts to just 2; this is referred to as 4-to-2 garbled row reduction (or 4-2 GRR).
Importantly, 4-3 GRR is compatible with free-XOR since only one output garbled value needs to taken as a function of the input values (and the other garbled value can be set according to the fixed ∆). In contrast, 4-2 GRR is not compatible with free XOR. Nevertheless, in recent work, a new method called half gates [24] reduces the number of ciphertexts in AND gates from 4 to 2, while maintaining compatibility with free XOR (in fact, half gates only work with free XOR).
• Number of encryptions [20] : Classically, each entry in a garbled gate contains the encryption of one of the output garbled values under two input garbled values, and thus requires two encryptions. In [20] , it was proposed to use a hash function as a type of key-derivation function, and to encrypt by hashing both input garbled values together and XORing the result with the output garbled value. This is secure in the random-oracle model, or under a "correlationrobustness" assumption [13] . This reduces the number of operations from 2 to 1. (Note however that two AES operations are typically much faster than a single hash operation, especially when utilizing the AES-NI instruction.)
• Fixed-key AES and use of AES-NI [5] : AES-NI is a set of CPU instructions that are now part of the Intel architecture. They allow AES computations to be carried out at incredibly fast rates, especially in modes of operation that can be highly pipelined. AES-NI offers instructions for encryption/decryption and for the AES key expansion.
However, since typical AES usages encrypt multiple blocks with a single key, the key expansion instructions do not highly optimize this part of the processing, and the key schedule generation routine is relatively expensive (compared to encryption/decryption). More importantly, pipelining cannot be carried out between different keys. When computing garbled circuits, 4 different keys are used in every gate, requiring many key schedules to be computed and preventing the use of pipelining.
In light of this, [5] proposed a method of using AES that is secure in the public random permutation model (i.e., assuming that AES for every fixed key behaves like a public random permutation). The method uses a fixed key for AES, applies AES on a combination of the input garbled values, and XORs the result with appropriate output garbled value. This reduces the number of AES computation to 4 per gate. Furthermore, since a fixed key is used, only one key schedule needs to be computed for the entire circuit, and the encryptions within a gate can be fully pipelined. This led to an extraordinary speedup in the computation of garbled circuits, as demonstrated in the JustGarble implementation [5] .
We stress that there have been a very large number of works that have provided highly significant efficiency improvements to protocols that use garbled circuits. However, our focus here is on improvements to garbled circuits themselves.
Our Results
We construct fast garbling methods solely under the assumption that AES behaves like a pseudorandom function. In particular, we do not use fixed-key AES (since this requires assuming that fixed-key AES behaves like a public random permutation), and we use two AES encryptions per entry in the garbled gates (since using just one encryption requires some sort of related-key security assumption). In addition, we do not use free-XOR (since this requires circularity). However, this does enable us to use 4-to-2 row reduction. In brief, we construct the following:
• Fast AES-NI without fixing the key: We show that, in addition to pipelining encryptions, it is also possible to pipeline the key schedule of AES-NI, in order to achieve very fast garbling times without using fixed-key AES or any other non-standard AES variant. Namely, the key schedule processing of different keys can be pipelined together, so that the amortized effect of key scheduling on Yao garbling is greatly reduced. Our experiments (described below) show that this and other optimizations of AES operations have become so fast that the benefits of using fixed-key AES are almost insignificant. Thus, in contrast to current popular belief, in most cases fixed-key AES is not necessary for achieving extremely fast garbling.
• Low-communication XOR gates: Over the past years, it has become apparent that in secure protocols, communication is far more problematic than computation. The free-XOR technique is so attractive exactly because it requires no computation but also no communication for XOR gates. We provide a new garbling method for XOR gates that requires storing only a single ciphertext per XOR gate; our technique is inspired by the work of [15] . The computational cost is 3 AES computations for garbling the gate, and 1-2 AES computations for evaluating it. (This overhead is for an optimized garbling method that we show. We first present a basic scheme requiring 4 AES computations for garbling and 2 computations for evaluation.)
• Fast 4-2 row reduction: As we have mentioned, once we no longer use the free-XOR technique, we are able to use 4-2 GRR on the non-XOR gates. However, the method of [22] that uses polynomial interpolation is rather complex to implement (requiring finite field operations and precomputation of special constants to make it fast). In addition, even working in GF (2 n ) Galois fields and using the PCLMULQDQ Intel instruction, the cost is still approximately half an AES computation. We present a new method for 4-2 row reduction that uses a few XOR operations only, and is trivial to implement.
We implemented these optimizations and compared them to JustGarble [5] . There is no doubt that the cost of garbling and evaluation is higher using our method, since we have to run AES key schedules, and we pay for computing XOR gates. However, we show that within protocol executions, the difference is insignificant. We demonstrate this running Yao's protocol for semi-honest adversaries which has nothing but oblivious transfer (for which we use the fast OT extensions of [2] ), garbled-circuit evaluation and computation, and communication. Experimental results. We ran Yao's protocol for semihonest adversaries inside Amazon EC2. The details of the results can be found in Section 5. The results show removing the public random permutation assumption does not noticeably affect the performance of the protocol. Furthermore, in many scenarios, such as small circuits, large inputs, or relaively slow communication channels, garbling under the most conservative assumption (the existence of PRFs) performs on par with the most efficient garbling methods.
Patent-free garbled circuits. Another considerable advantage of using our method for computing XOR gates with low communication is that it does not rely on the free XOR technique and thus is not patented. Since patents in cryptography are typically an obstacle to adoption, we believe that the search for efficient garbling techniques that are not patented is of great importance.
Garbling under weaker yet non-standard assumption. Our work focuses on the comparison between garbling under a variety of strong assumptions (i.e., circularity, public random permutation) and garbling under a standard pseudorandom function assumption only. However, there are also garbling schemes that have been proven secure under a related-key assumption, but without circularity [15] . In the full version of this paper [11] , we continue the directions introduced by [15] in order to provide a more complete picture regarding the trade-off between efficiency and security. We present two new heuristics for solving the algorithmic problem presented in [15] , and show that a related-key assumption based garbling scheme (using any of the suggested heuristics) improves garbling and computation time, but fails to significantly reduce the communication overhead of the protocol.
FIXED-KEY AES VS. REGULAR AES
Background. Bellare et al. introduced the use of fixed-key AES in garbling schemes and implemented the JustGarble library [5] . This significantly speeds up garbling since the AES key schedule (which is quite expensive) need not be computed at every gate. Note that when constructing the garbled circuit four key schedules are required for every gate, and when evaluating the circuit two key schedules are required for every gate. This is very expensive. In addition, JustGarble utilizes the AES-NI instruction set with encryption pipelining, significantly reducing the cost of the AES computations.
Despite its elegance, the use of fixed-key AES requires the assumption that fixed-key AES behaves like a public random permutation. This is a very strong assumption, and one that has been brought into question regarding AES specifically by the block-cipher research community; see, for example, [14] . Clearly, the acceptance of this assumption in the context of secure computation and garbling is due to the perceived very high cost of garbling in any other way. However, the comparisons carried out in [5] to prior work are to Kreuter et al. [18] who use AES-256 using AES-NI without pipelining, and to Huang et al. [12] who use a hash function only. Thus, it is unclear how much of the impressive speedup achieved by [5] is due to the savings obtained by using fixed-key AES, and how much is due to the other elements that they included (pipelining of the AES computations in each gate, optimizations to the circuit representation, and more).
In this section, we show that it is possible to achieve fast garbling without using fixed-key AES and thus without resorting to public random permutation model. We stress that some penalty will of course be incurred since the AES key schedule is expensive. Nevertheless, we show that when properly implemented, in many cases the penalty is not significant and it suffices to use regular AES. The goal is to make the performance depend on the throughput (which is excellent when pipelining is used) and not on the latency of a single computation. This goal can be achieved rather easily for the AES encryption alone, but we also achieve the more challenging task of pipelining the key schedule as well as the encryption.
Utilizing the AES-NI Pipeline
The standard way of garbling a gate uses double encryption. Specifically, given 4 keys k , this means that the encryptions must be computed sequentially and not in parallel. This makes a huge difference when using the AES-NI chip, since the cost of 8 pipelined encryptions is only slightly more than the cost of a single non-pipelined encryption. 2 We therefore garble an AND gate in a way that enables pipelining. This is easily achieved by applying a pseudorandom function F (which will be instantiated as AES) to the gate index and appropriate signal/permutation bits. This ensures independence between all values. For example, an AND gate where both signal bits are 0 can be garbled as follows:
One way of looking at this is simply double-encryption in "counter mode"; intuitively this is therefore secure (as with all of our constructions, full proofs appears in the full version of the paper [11] ).
Needless to say, 4-to-3 GRR can also be carried out by
(g 00) meaning that the first ciphertext equals 0 and so need not be stored. Observe here that there are 8 encryptions. However, all inputs are known and therefore it is possible to pipeline these computations.
Note that it is essential to take both signal bits as part of the input of F . Otherwise, the scheme is not secure. To understand this, assume that the gate was garbled as in the example above but without using signal bits (e.g., the value
, and assume that the evaluator holds the keys k 0 i , k 0 j . The evaluator will compute k 0 , but then it will also be able to compute ). Now, the evaluator would be able to compute k 1 as well, using the fourth garbled entry. Taking both signal bits as part of F 's input prevents this from happening, as the evaluator cannot learn
Pipelining Key Schedule and Encryption
The computations that are needed for garbling and evaluating garbled circuits are as follows: -KS4 ENC8: This consists of the computation of 4 AES key schedules from 4 different keys. The resulting keys are then used to encrypt 8 blocks (each key is used for encrypting 2 blocks). This is used for garbling AND (and other non-XOR) gates.
-KS2 ENC2: This consists of the computation of 2 AES key schedules from 2 different keys. The resulting keys are then used to encrypt 2 blocks (each key is used for encrypting 1 block). This is used for evaluating all gates.
-KS4 ENC4: This consists of the computation of 4 AES key schedules from 4 different keys. The resulting keys are then used to encrypt 4 blocks (each key is used for encrypting 1 block). This is used for garbling XOR gates according to our new XOR-gate garbling scheme described in Section 3.2.
A naïve software implementation approach for these computations would use the appropriate sequence of calls to a "key expansion" function, and to a "block encryption" function.
To estimate the performance of that approach, we use, as a comparison baseline, the OpenSSL (1.0.2) library, running on the Haswell architecture.
Software running on this processor can use the AES hardware support, known as AES-NI (see [8, 9] for details). On this platform, a call (using the OpenSSL library) to an AES key expansion consumes 149 CPU cycles. A call to an (ECB) encryption function to encrypt 2/4/8 blocks consumes approximately 70+ cycles (explanation is provided below). However, OpenSSL's API does not support ECB encryption with multiple key schedules. For example, this implies that KS4 ENC4 would required 4 calls to the key expansion function, followed by 4 calls to an ECB encryption, each one applied to a single (16B) block. The resulting performance of KS4 ENC4, KS4 ENC8, KS2 ENC2 obtained by calling OpenSSL's functions (namely "aesni set encrypt key" and "aesni ecb encrypt") is summarized in middle column of Table 1 at the end of this section.
Our goal is to optimize the computations of KS4 ENC4, KS4 ENC8, KS2 ENC2, and alleviate the overhead imposed by the frequent key replacements. We achieve our optimization by: (a) interleaving the encryption of independent blocks; (b) optimizing the key expansion; (c) aggressive interleaving of the operations; (d) building an API that allows for encrypting with multiple key schedules. The details are as follows.
Interleaved encryption. AES encryption on a modern processor is accelerated by using the AES-NI instructions (see [8, 9] ). Assuming that the cipher key is expanded to a key schedule of 11 round keys, RK[j], j=0, . . ., 10, AES encryption of a 16 bytes block X is achieved by the code sequence
If the latency of the AESENC/AESENCLAST instructions is L cycles, then the above flow can be completed 3 Haswell (resp., Broadwell) is an Intel Architecture Codename of a recently announced 4th (resp., 5th) Generation Intel R Core TM Processor. For short, we refer to them simply as Haswell (resp., Broadwell).
in 1 + 10L cycles. However, if the throughput of AES-ENC/AESENCLAST is 1 (i.e., pipelining can be used and the processor can dispatch AESENC/AESENCLAST every cycle, if the data is available), and the computations encrypt more than one block, the software can interleave the AES-ENC/AESENCLAST invocations. This achieves a higher computational throughput, compared to the single block encryption. Furthermore, the AESENC/AESENCLAST instructions can be applied to any round key, even those generated by different key schedules. For example, 2 blocks X and Y , can be encrypted, with 2 different key schedules KS1 and KS2, by the following code sequence:
These computations can be completed within 10L + 1 cycles (the 2 XOR's of the whitening step can be executed in one cycle). Similarly, encrypting 4/8 blocks with an interleaved software flow could (theoretically) terminate after (2 + 10L + 3) /(4 + 10L + 7) cycles. (This idealized estimation assumes that the round keys are fetched from the processor's cache, and ignores the cost of loading/storing the input/output blocks. We point out that the code sequence indeed closely approaches the theoretical performance, under these assumptions.) These computations are dominated only by the throughput of AESENC/AESENCLAST. Note that L = 7 on Haswell, and the AESENC/AESENCLAST throughput is 1. As can be seen, Optimized key expansion. We were able to optimize the computation of AES key expansion so that it computes (and stores) an AES128 key schedule in 96 cycles on Haswell, which is 1.55 times faster than the code used by OpenSSL on the same platform. The details of this optimization are quite low-level, and we provide here only some high-level details. The a full set of key expansion code options, was contributed to the NSS open source library, and can be found in [10] .
The AES-NI instruction set includes instructions that facilitate key expansion. For the encryption key schedule, the relevant instruction is AESKEYGENASSIST. However, this instruction does not provide a throughput of 1 and is significantly slower than the AESENC and AESENCLAST operations (the reason being that key schedules are typically run only once and so the cost involved in optimizing this instruction was not justified). We observe that the key schedule consists of S-box substitutions together with rotation and XOR operations. Likewise, the last round of AES costs of S-box substitutions together with shift rows (and key mixing, which can be effectively cancelled by using a round key of all-zeroes). Thus, the use of AESKEYGENASSIST can be replaced by a combination of a shuffle followed by an AESENCLAST invocation, to isolate the S-box transformation. 4 The shuffle is carried out efficiently using the PSHUFB instruction which also has a throughput of 1. We therefore obtain that the key schedule can be "simulated" using much faster instructions. Additional optimizations can be obtained by judicious usage of the available instructions to generate efficient sequences. We give one example. As explained above, the S-box substitution can be isolated by a shuffle followed by AESENCLAST, and if we place (duplicated) RCON in the second operand of AES-ENCLAST, the addition of RCON is also done by AESEN-CLAST. The arrangement and XOR-ing of the "words" can be implemented by the following straightforward flow:
However, the same functionality can be achieved by a shorter, 4 instructions, flow, as follows: In this way, the 3 shuffles and 3 xors of the straightforward flow, can be replaced by shorter and faster 1 shift, 1 shuffle and 2 flows. With our optimizations, we were able to write a key expansion code that computes and stores an AES128 key schedule in 96 cycles on Haswell (i.e., 1.55 times faster than OpenSSL). Multiple aggressive interleaving. A higher degree of optimization can be achieved by interleaving the computations of multiple key expansions. This helps in partially alleviating the key expansion's dependency on the latency of AESENC. For example, our code for expanding 2 key schedules consumes 124 cycles (on Haswell), which is significantly less than two independent (without interleaving) key schedules, that are 2 × 96 cycles. We applied this technique to obtain an optimized KS4 ENC4 and KS4 ENC8 implementation. For KS2 ENC2, optimization is achieved by "mixed interleaving" of the key expansion and the encryptions.
The performance of the optimized KS4 ENC4, KS4 ENC8, and KS2 ENC2 is summarized in the right column of Table 1.
Experimental Results
The results in Table 2 show the garbling and evaluation time of 1000 AES circuits, using the free-XOR technique and 4-to-3 row reduction (as used by JustGarble, in order to make a fair comparison). All methods use pipelining of the encryptions (the last two entries do not use a fixed key and 2) functions for AES key expansion and for ECB encryption. The performance of the optimized implementations is of C code (compiled using gcc), and of handwritten assembly implementations (marked with "asm").
therefore use the encryption pipelining method described in Section 2.1). The last entry is based on using also the key scheduling pipelining method described in Section 2.2. The table shows the results for garbling and evaluating the circuit. We stress that the times in Table 2 The results show that pipelining the key schedule as well as the encryptions (3rd row) reduces time by more than 50% over pipelining the encryptions only (2nd row). Fixed-key AES (1st row) does provide a significant improvement and the best performance. However, the gain in using fixed-key AES is not overwhelming, since, as we will show later on, in many settings the main cost of secure computation is no longer the garbling itself. Namely, although AES takes 86% more time without a fixed key, the objective difference is just 0.344 milliseconds. Thus, when run in a protocol that includes communication, this additional time makes almost no difference. We demonstrate this in our experiments described in Section 5.
GARBLING UNDER A PSEUDORAN-DOM FUNCTION ASSUMPTION ONLY

Background
The free-XOR technique [16] is one of the most significant optimizations of garbling. When using this technique, the garbling and evaluation of XOR gates are essentially for free, requiring only two XOR operations for garbling and one for evaluating. In addition, no garbled table is used, thereby significantly reducing communication. However, the free-XOR technique also requires non-standard assumptions. Specifically, when using this method, there is a global offset ∆, and on every wire a single random k 0 i is chosen and the other key is always set to k
This is secure in the random oracle model [16] or under a circular-secure correlation robustness or related key assumption [7] (correlation robustness is formalized for hash functions whereas related key security is for encryption or pseudorandom functions). The need for this assumption is due to the fact that when a global offset is used, multiple encryptions are made under related keys ka, ka ⊕∆, k b , k b ⊕∆, and so on. In addition, since these keys are used to encrypt the values kc and kc ⊕ ∆, the ciphertext is related to the secret key which is exactly circular security. We remark that at some additional cost, the circularity assumption can be removed using the FleXOR technique [15] . However, the correlation robustness/related key assumption remains. 5 We next show that it is possible to efficiently garble a circuit using a pseudorandom function only. We first show a basic version of our garbling scheme, where the garbled table for a X OR gate contains a single ciphertext and requires 4 pseudorandom function operations for garbling (instead of 8 for an AND gate), and 2 for evaluation. We then show an optimized version that reduces the number of PRF invocations to 3 calls for garbling, and 1-2 calls for evaluation. The overhead of these schemes is definitely beyond that of the free XOR technique. However, as we will show, the techniques are a considerable improvement over the naive method of computing XOR like an AND gate, they enable the usage of 4-2 garbled row reduction (4-2 GRR), and within protocols (where communication and other factors become the bottleneck) they perform well.
Garbled XOR With a Single Ciphertext
In order to prove security solely under the assumption that the primitive used is a pseudorandom function, all the garbled values on all wires should be independently chosen. Thus, for all pairs of wires i and j, the keys k
should be independent and either uniformly distributed or pseudorandom. It will be useful to equivalently write the keys as k We use the point-and-permute method, described briefly in the introduction. In order to avoid confusion, we will call the bit used to determine the order of the ciphertexts in the garbling phase the permutation bit (since it determines the random order), and we call the bit that is viewed by the evaluator when it evaluates the circuit the signal bit (since it signals which ciphertext is to be decrypted). We denote the permutation bit on wire i by πi, and we denote the signal bit on wire i by λi. Observe that if the evaluator has bit vi on wire i (for which it does not know the value), then it holds that λi = πi ⊕ vi. Thus, if πi = 0, then the evaluator will see λi = vi, and if πi = 1 then the evaluation will see λi = vi (its complement). Since πi is random, this reveals nothing about vi whatsoever.
We now describe the basic XOR gate garbling method that uses just a single ciphertext. The method requires 4 calls to a pseudorandom function for garbling, but as we have seen, this is inexpensive using AES-NI. (We remark that AND gates are garbled in the standard way, independently of this method.) Denote the input wires to the gate by i, j and denote the output wire from the gate by . We therefore 5 We note that garbling with hash functions is much slower than with AES, especially when an AES-NI supporting architecture is utilized. Thus, related-key security for AES is required, which is a less than ideal assumption.
⊕ ∆j. According to the above, we denote by πi, πj the permutation bits on wires i and j respectively. As we will see, the keys on the output wire will be determined as a result of the garbling method. The method for garbling a XOR gate with index g is as follows: we translate the input keys on wire j so that they too have the offset ∆ (this will enable the output key to be computed by XORing the translated input keys, as in the free XOR technique). Thus, we setk
where πj is the random permutation bit that is associated with the bit 0 on wire j. (as we show, this can be implicitly determined from the signal bit λi), then it can computek π j j . The only problem is that it cannot compute kπ j j since it does not know ∆ (and furthermore ∆ cannot be revealed). Thus, the ciphertext for the gate is set to T = F In order to evaluate a XOR gate g with ciphertext T , given a key ki on wire i and a key kj on wire j, the evaluator simply needs to computeki = F k i (g) and eitherkj = F k j (g) if it has signal bit 0, orkj = F k j (g) ⊕ T if it has signal bit 1. Then, the key on the output wire is obtained by finally computing k =ki ⊕kj.
The computational cost of garbling the gate is 4 pseudorandom function computations, and the computational cost of evaluating the gate is 2 pseudorandom function computations. Most significantly, the gate table includes only a single ciphertext.
Reducing the number of PRF calls to 3. Observe that the pseudorandom function is used to ensure independence of the ∆ values between different gates. If we were to just take ∆ = k 0 i ⊕ k 1 i , then the output ∆ from two different gates with the same input wire i would be the same, and once again correlation robustness or a related key assumption would be needed. Thus, it is necessary to computẽ k
In contrast,k 0 j can be taken to simply be k 0 j and the pseudorandom function computation is not needed. This is because ∆ is fixed independently of wire j. Using this method, we can reduce the computational cost of garbling the XOR gate from 4 pseudorandom function computations to 3 pseudorandom function computations (and the computational cost of evaluating the gate is decreased from 2 to either 1 or 2 PRF computations). Garbling NOT Gates. When using free XOR, it is possible to efficiently garble NOT gates by simply defining them to be XOR with a fixed wire that is always given value 1. Since the XOR gates are free, this is highly efficient. However, since we are not using free XOR, a different method needs to be found. Fortunately, NOT gates can still be computed for free, and with no additional assumption. In order to see this, let g be a NOT gate with input wire i and output wire j, and let k The full specification and security. The detailed description of this garbling scheme, along with its full proof of security, appears in the full version of this paper [11] .
Intuition for security. The detailed proof of security appears in the full version [11] . We describe here only the intuition behind the proof. Observe that the ciphertext in a XOR gate with input wires i, j and output wire equals
and thatk
.n]⊕∆ . In addition, by the way we defined ∆ , we have ∆ =k
where πi, πj are the permutation bits that are associated with the bit 0 on wires i, j respectively. Stated differently, the ciphertext in a XOR gate is the result of XORing the four outputs of the pseudorandom function, as shown above. Each one of these four computations uses a different key, 6 It may be tempting to propose that one of k from which only two keys are known to the evaluator. Since we use the gate index as an input to the function, we are guaranteed that when a wire enters multiple gates, the pseudorandom values we compute will be different in each of the gates. Thus, the ciphertext looks like a random string to the evaluator. In addition, the output-wire key values are determined by the result of the pseudorandom function computation as well. Thus, they are new keys that do not appear elsewhere in the circuit. We stress that the four values in the equation above are not the four new translated keys. If that was the case, then XORing them would yield 0, because the same offset is used in both wires after the translation. Instead, the first three values are the translated keys, but the last value is just a pseudorandom string that is used to mask them in a "one-time pad"-like encryption.
A similar argument applies for AND gates. Since the evaluator can compute only two of the eight PRF computations using the two keys it holds, and since the values that are used in computing the garbled table are unique and do not appear elsewhere in the circuit (again, this is ensured by using the gate index and the permutation bits as input to each pseudorandom function computation), the gate ciphertexts that are not associated with the keys known to the evaluator, look random to the evaluator.
XOR Gates with Only Three PRF Computations
Our garbling method requires four calls to the pseudorandom function for garbling XOR gates, where each call uses a different key. In this section we show that it is possible to remove one of these calls by leaving one of the input keys unchanged. Recall that our ciphertext for a XOR gate g with input wires i, j and output wire is:
Assume the evaluator has the keys k
and the signal bits λi, λj when computing the gate. Then, using the cipher-
which is the XOR of two pseudorandom values. If we leave, for example, the value of k v j j unchanged -i.e., use it in Eq. (1) instead of F k v j j (g λj) -then the evaluator will be able to compute
Observe that the evaluator still cannot learn anything since one of the two values is a new pseudorandom value that does not appear anywhere else in the circuit. Therefore, the ciphertext is pseudorandom as required. In addition, since the two keys on wire i are still translated to new keys, the output wire keys, generated in the same way as before, are guaranteed to obtain new fresh values. (See Footnote 6 as to why we cannot use the same method to remove one of the pseudorandom function calls on wire i as well.)
SIMPLE AND FAST 4-2 GRR
Overview
Abstractly, gate garbling typically works by generating four pseudorandom masks K0, K1, K2, K3, corresponding to the four possible input combinations, in some permuted order. In the notation we have used so far we have that K0 = F k
and so on (note that K1 equals the value used to mask the output key k g(π i ,π j ) in T1). The evaluator of the circuit is able to compute one of these four masks, and can also use the signal bits to identify the index of that mask. Namely, it computes a pair (i, Ki) (but is unable to identify the real input combination corresponding to the value that it computed). In our base scheme described in the previous section, we garbled non-XOR gates with three ciphertexts for each garbled gate. One of the ciphertexts was "removed" by setting one of the keys on the output wire to actually be K0 rather than using K0 to mask the key (this is called garbled row reduction, or GRR for short). In this section we improve on this by applying a 4-2 row reduction technique on these gates in order to remove an additional ciphertext. There are two known such techniques: The 4-to-2 reduction technique method of [22] and the new "Half-Gates" approach of [24] . The "Half-Gates" technique was designed to be compatible with the free-XOR technique and actually requires free-XOR; as such, it is based on the circularity assumption and so is not suitable for this paper. In contrast, the 4-2 GRR technique of [22] does not require free-XOR; it has been proved relying on a standard assumption only and can be incorporated into our scheme. However, in this technique, the generation of the garbled table by the circuit garbler, as well as the computation of the output wire key given two ciphertexts of the gate table and the K value, are carried out by interpolating a degree 2 polynomial. We describe here a different 4-to-2 garbling method where the garbling and evaluation of the gate use only simple XOR operations. This is preferable for two major reasons:
• Efficiency: Polynomial interpolation uses three finite field multiplications and two additions (after the Lagrange coefficients are precomputed). The overhead of computing the multiplications is rather high, even when implemented in GF (2 128 ). For example, our implementation, which used the PCLMULQDQ Intel instruction, needed about half as many cycles as AES encryption.
• Simpler coding: Efficient implementation of polynomial interpolation, especially over GF (2 128 ), and using machine instructions rather than calling a software library, requires some expertise and is significantly harder to code than a few XOR operations.
Gate evaluation. We first describe the process of evaluating a gate. We will then describe the garbling procedure which enables this gate evaluation procedure. Although this is somewhat reversed (as one would expect a description of how garbling is computed first), we present it this way as we find it clearer.
The gate evaluator receives as input a gate table with two entries [T1, T2], an index i ∈ {0, 1, 2, 3}, and a value Ki computed from the two garbled values of the input wires (note, T1, T2, Ki are all 128 bit strings). It computes the garbled output wire key kout in the following way:
Garbling. We now show how to garble AND gates so that the evaluation described above provides correct evaluation. Due to the random permutation applied to the rows (via the permutation bit), the single output bit "1" of these gates might correspond to any of the masks K0, K1, K2, K3. Denote the index of that mask as s ∈ {0, 1, 2, 3}, and denote by k 0 out , k 1 out the output wire keys. We need to design a method for computing the garbled output key from the garbled 
In the 4-to-3 row reduction method, the garbled-gate entry T0 is always 0, and therefore (1) there is no need to store and communicate that entry, and ( In our new garbling method we use the freedom in choosing the second output wire key to always set it to K1 ⊕ K2 ⊕ K3. As a result, and as will be explained below, the garbled table will have the property that entry T3 of the table satisfies T3 = T1 ⊕ T2. Therefore T3 can be computed in run-time by the evaluator and need not be stored or sent. In summary, garbling is carried out as follows:
This fully defines the garbled table, as follows:
See Table 3 for the full definition of the garbled table [T1, T2] and the definition of the output wires, depending on the permutation (recall that s is the index such that Ks = k 1 out ). It is easy to verify correctness by tracing the computation in each case according to the table.
An alternative way to verify that the new scheme is correct is to observe that the output wire key computed for K3 is always Encoding the permutation bits. The permutation bits can be encoded in a similar way to that suggested in [22] . Two changes are applied to the basic garbling scheme:
• The garbled values are only n−1 bits long, whereas the values Ki are still n bits long (concretely here, we use n = 128). Therefore, the function used for generating the Ki inputs has n−1-bit inputs and an n-bit output. We denote the least significant bit of Ki by mi. Only n − 1 bits of Ki are used for computing the garbled key of the output wire, using the procedure described above. Consequently, the values T1, T2 of the garbled table are also only n − 1 bits long.
• We add 4 bits to the table. The ith of these bits is the XOR of mi with the permutation bit of the corresponding output value.
The total length of a gate table is now 2(n − 1) + 4 = 2n + 2 bits (concretely 258 bits). The evaluation of a gate is performed by computing Ki; using its most significant n − 1 bits for computing the corresponding garbled output value; and using its least significant bit mi for computing the corresponding signal bit. As for security, note that the mi bits are pseudorandom, and are used only for the encryption of the permutation/ signal values. Intuition for Security. Recall that the 4-to-3 garbled row reduction scheme enables an arbitrary choice of the output wire key that is not k[T0]. The new 4-to-2 garbled row reduction scheme that we present is a special case, where we define that output wire key to be equal to K1 ⊕ K2 ⊕ K3. Note that the evaluator can compute one of the Ki values using the two keys it holds, and can obtain two of the other three using T1, T2. However, in order to learn the other output wire key it needs the one Ki value that it cannot compute. Thus, from the point of view of the evaluator, the other output wire key, is a random string as required.
The complete specification of the row-reduction scheme and its proof of security appears in the full version of this paper [11] .
EXPERIMENTAL RESULTS AND DISCUSSION
In the previous sections, we presented four tools that can optimize the performance of garbled circuits without relying on any additional cryptographic assumption beyond the existence of pseudorandom functions: (1) pipelined garbling; (2) pipelined key-scheduling; (3) XOR gates with one ciphertext and three encryptions; and (4) improved 4-2 GRR for AND gates. In this section, we present the results of an experimental evaluation of these methods -together and separately -and compare their performance to that of other garbling methods. Table 4 shows the time it takes to run the full Yao semihonest protocol [19, 23] on three different circuits of interest: AES, SHA-256 and Min-Cut 250,000. The circuits have 6,800, 90,825 and 999,960 AND gates, respectively, and 25,124, 42,029 and 2,524,920 XOR gates, respectively. The number of input bits for which OTs are performed are 128, 256 and 250,000, respectively [1] . We remark that our implementation of the semi-honest protocol of Yao utilizes the highly optimized OT extension protocol of [2] .
We examined eight different schemes, described using the following notation: [pipe-garble] for the pipelined garbling method; [pipe-garble+KS] for both pipelined garbling and pipelined key-scheduling; [fixed-key] where all PRF evaluations were performed using the fixed-key technique described in [5] ; where XOR gates were garbled using a simple 4-3 GRR method; [XOR-1] where XOR gates were garbled using our method of garbling with one ciphertext; [free-XOR] where the free-XOR technique was used; where AND gates were garbled using simple 4-3 GRR; [AND-2] where our 4-2 GRR method was used to garble AND gates; and finally, [AND-HalfGates] where the "half-gates" technique of [24] was used to garble AND gates. Note that the half-gates method is only used in conjunction with free-XOR since this is a requirement.
The first scheme in Table 4 is the most "naïve", where a simple 4-3 GRR was used for both AND and XOR gates and the garbling was pipelined but not the key-scheduling. In contrast, the last scheme is the most efficient as it uses fast fixed-key encryption and the half-gates approach to achieve two ciphertexts per AND gates and none for XOR gates. However, this scheme is based on the strongest assumption, that AES behaves like a random permutation with a fixed key. The third scheme in the table uses all our optimizations together, and thus it is the most efficient scheme that is based on a standard PRF assumption. The sixth scheme in the table shows the best that can be achieved while assuming circularity and related key security, but without resorting to the public random permutation assumption.
The experiments were performed on Amazon's c4.8xlarge compute-optimized machines (with Intel Xeon E5-2666 v3 Table 5 : Summary of garbled-circuit size in number of ciphertexts, according to scheme. The schemes are as in Table 4 .
Haswell processors) running Windows. The measurements include the time it takes to garble the circuit, send it to the evaluator and compute the output. Since communication is also involved, this measures improvements both in the encryption technique and in the size of circuit. Each scheme was tested on the three circuits in two different settings: the Virginia-Virginia (VA-VA) setting where the two parties running the protocol are located at the same data center, and the Virginia-Ireland (VA-IRE) setting where the physical distance between the parties is large. (We omitted the results of running the large min-cut circuit in the VA-IRE setting as they were not consistent and had a high variability.) Each number in the table is an average of 20 executions of the indicated specific scenario. The table rows marked in boldface highlight the best schemes under each set of assumptions. Looking at the results, we derive the following observations:
• Best efficiency: As predicted, the fixed key + halfgates implementation (8) is the fastest and most efficient in all scenarios. (This seems trivial, but when using fixed-key AES, the Eval procedure at AND gates requires one more encryption than in a simple 4-3 GRR. Thus, this confirms the hypothesis that the communication saved is far more significant than an additional encryption, that is anyway pipelined.)
• Small circuits: In small circuits (e.g., AES) the running time is almost identical in all schemes and in both communication settings. In particular, using our optimizations (3) yields the same performance result as that of the most efficient scheme (8) , in both the VA-VA and VA-IRE settings. This is due to the fact that in small circuits, running the OT protocol is the bottleneck of the protocol (even if, as in our experiments, optimize OT-extension [2] is used). This means that for small circuits there is no reason to rely on a nonstandard cryptographic assumption.
• Medium circuits: In the larger SHA-256 circuit, where the majority of the gates are AND gates, there was a difference between the results in the two communication settings. In the VA-VA setting the best scheme based on PRF alone (3) has performance that is closer to that of the naïve scheme (1) than to that of the schemes based on the circularity or the public random permutation assumptions (schemes 6 and 8). In contrast, in the VA-IRE setting the PRF based scheme performs close to schemes 6 and 8. This is explained by observing that when the parties are closely located, communication is less dominant and garbling becomes a bigger factor. Thus, garbling XOR gates for free improves the performance of the protocol. In contrast, when the parties are far from each other, communication becomes the bottleneck, thus the PRF based scheme (3) yields a significant improvement compared to the naïve case (1) and its performance is not much worse than that of the best fixed-key based scheme (and since there are fewer XOR gates, the overhead of an additional ciphertext per gate is reasonable).
• Large circuits: In the large Min-Cut circuit, the run time of our best PRF based scheme (3) is closer to the best result (8) than to the naïve result (1). This is explained by the fact that the circuit is very large and so bandwidth is very significant. This is especially true since the majority of gates are XOR gates, and so the reduction from 3 ciphertexts to 1 ciphertext per XOR gate has a big influence. (Observe that the number of ciphertexts sent in (8) is 2,000,000, the number of ciphertexts sent in (3) is 4,500,000, while the number of ciphertexts sent in (1) is 10,500,000.) Observe that schemes (6) and (8) have the same bandwidth; the difference in cost is therefore due to the additional cost of the AES key schedules and encryptions. Note, however, that despite the fact that there are 1,000,000 AND gates, the difference between the running-times is 15%, which is not negligible but also not overwhelming.
• Removing the public random permutation assumption: Comparing scheme (8) , which is the most efficient, to scheme (6) which is the most efficient scheme that does not depend on the public random permutation assumption, shows that in all scenarios removing the fixed-key technique causes only a minor increase in running time.
We conclude that strengthening security by removing the public random permutation assumption does not noticeably affect the performance of the protocol. Thus, in many cases, two-party secure computation protocols does not need to use the fixed-key method. Further security strengthening by not depending on a circularity assumption (i.e., "paying" for XOR gates) does come with a cost. Yet, in scenarios where garbling time is not the bottleneck (e.g. " small circuits, large inputs, communication constraints), one should consider using a more conservative approach as suggested in this work. In any case, we believe that our ideas should encourage future research on achieving faster and more efficient secure two-party computation based on standard cryptographic assumptions.
