In the recent years, secure computation has been the subject of intensive research, emerging from theory to practice. In order to make secure computation usable by non-experts, Fairplay (USENIX Security 2004) initiated a line of research in compilers that allow to automatically generate circuits from high-level descriptions of the functionality that is to be computed securely. Most recently, TinyGarble (IEEE S&P 2015) demonstrated that it is natural to use existing hardware synthesis tools for this task.
INTRODUCTION
Secure computation allows multiple parties to evaluate a function on their private inputs without revealing any information except for the result of the computation. The first protocols given were Yao's garbled circuits protocol [Yao86] and the protocol of Goldreich-Micali-Wigderson (GMW) [GMW87] . Both protocols securely evaluate a Boolean circuit that represents the desired functionality. Since then, a large body of literature has been investigating the design and implementation of practical circuit-based secure computation in different adversarial settings. While designing efficient and correct circuits for smaller building blocks for simple applications can be performed manually by experts, this task becomes highly complex and time consuming for large applications such as floating-point arithmetic and signal processing, and is thus error-prone. Faulty circuits could potentially break the security of the underlying applications, e.g., by leaking additional information about a party's private inputs. Hence, an automated way of generating correct large-scale circuits which can be used by regular developers is highly desirable.
A large number of compilers for secure computation such as [MNPS04, BNP08, HKS + 10, HEKM11, Mal11, MLB12, KSS12, HFKV12, SZ13, KSMB13, ZSB13] implemented circuit building blocks manually. Although tested to some extent, showing the correctness of these compilers and their generated circuits is still an open problem.
Recently, TinyGarble [SHS + 15] took a completely different approach by using already established powerful hardware logic synthesis tools and customizing them to be adapted to automatically generate Boolean circuits for functions to be evaluated by Yao's garbled circuits protocol. The advantage of this approach lies in the fact that these tools are being used by industry for designing digital circuits, and hence are tested thoroughly, which is justified by the high production costs of Application-Specific Integrated Circuits (ASICs). However, these tools are designed primarily to synthesize circuits on hardware target platforms such as ASICs or configurable platforms such as Field Programmable Gate Arrays (FPGAs) or Programmable Array Logic (PAL). Using hardware logic synthesis tools for special purposes such as generating circuits for secure computation, requires customizations and workarounds. Exploiting these tools promises accelerated and automated circuit generation, significant speedup, and ease in designing and generating circuits for much more complicated functions, while also maintaining the size (and depth) efficiency of hand-optimized smaller circuit building blocks. In particular, TinyGarble exploited the sequential logic to synthesize highly compact circuits. However, TinyGarble considered only few functionalities: addition, Hamming weight, comparison, multiplication, matrix multiplication, AES, SHA-3, and a MIPS CPU.
In this work we continue along the lines of using logic synthesis tools for secure computation and automatically synthesize an extensive set of basic and complex operations, including IEEE 754 compliant floating-point arithmetic. In contrast to TinyGarble, which generated size-optimized circuits for Yao's garbled circuits protocol, we focus on synthesizing depth-optimized circuits for the GMW protocol [GMW87] . Although the round complexity of the GMW protocol depends on the circuit depth, it has some advantages compared with Yao's constant-round protocol: 1) it allows to precompute all symmetric cryptographic operations in a setup phase and thus offers a very efficient online phase, 2) its setup phase is independent of the function being computed, 3) it balances the workload equally between all parties, 4) GMW allows for better parallel evaluation of the same circuit (SIMD operations) [SZ13, DSZ15] , 5) it can be extended to multiple parties, and 6) the TinyOT protocol [NNOB12] which provides security against stronger active adversaries, has an online phase which is very similar to that of GMW, and its round complexity also depends on the circuit depth.
We combine industrial-grade logic synthesis tools with the recent open-source ABY framework [DSZ15] which implements state-of-the-art optimizations of the two-party protocols by GMW and Yao. On the one hand, our approach allows to use existing and tested libraries for complex functions such as IEEE 754 compliant floating-point operations that are already available in these tools without the need to re-implement them manually. On the other hand, this allows to use high-level input languages such as Verilog where we map high-level operations to our optimized implementations of basic functions.
Outline and Our Contributions
After summarizing related work in §1.2 and preliminaries in §2, we present our following contributions:
Architecture and Logic Synthesis ( §3). We provide a fully-automated end-to-end toolchain allowing the developer to describe the function to be computed securely in a highlevel Hardware Description Language (HDL), such as Verilog, followed by the generation of the required customized circuit and its secure evaluation using either GMW [GMW87] or Yao's protocol [Yao86] . Our toolchain uses hardware synthesis tools, both open-source and commercial, to generate depth-and size-optimized circuits customized for both protocols respectively. For this, we manipulate and engineer state-of-the-art hardware synthesis tools with synthesis constraints and customized libraries to generate circuits optimized for either protocol according to the developer's choice.
Optimized Circuit Building Blocks ( §4). We develop a library of depth-optimized and size-minimized circuits, including arithmetic operations (e.g., addition, subtraction, multiplication, division), comparison, counter, and multiplexer, which can be used to construct more complex functionalities such as various distances, e.g., Manhattan, Euclidean, or Hamming distance. Some of the implemented building blocks show improvements in depth compared with hand-optimized circuits of [SZ13] by up to 14%, while others show at least equivalent results. Assembling sub-blocks from our customized library can be used to construct more complicated functionalities, which would otherwise be impossible to build and optimize by hand. We exploit the capabilities of our synthesis tools to bind high-level operators (e.g., the '+' operator) and functions to optimized circuits in our library to allow the developer to describe circuits in Verilog using high-level operators. We also utilize built-in Intellectual Property (IP) libraries in commercial hardware synthesis tools to generate Boolean circuits for more complex functionalities such as floating-point arithmetic which have been verified and tested extensively.
Benchmarks and Evaluation ( §5).
We use the ABY framework [DSZ15] to securely evaluate the Boolean circuits generated by our hardware synthesis toolchain. Moreover, we extend the list of available operations in ABY by multiple floating-point operations. In contrast to previous works that built dedicated and complex protocols for secure floating-point operations, we use highly tested industrialgrade floating point libraries. We compare the performance of our constructions with related work. For floating-point operations we achieve between 0.5 to 21.4 times faster runtime than [ABZS13] and 0.1 to 3 267 times faster runtime than [KW14] . We emphasize that we achieve these improvements even in a stronger setting, where all but one party can be corrupted and hence our protocols also work in a two-party setting, whereas the protocols of [ABZS13, KW14] require a majority of the participants to be honest and hence need n ≥ 3 parties. We also present timings for integer division that outperform related work of [ABZS13] (3-party) by a factor of 0.6 to 3.7 and related work of [KSS13] (2-party) by a factor of 32.4 to 274. Additionally, we present benchmarks for matrix multiplication, but here we are slower than previous approaches of [BNTW12, ZSB13, DSZ15] .
Application: Private Proximity Testing ( §6). A real world application of floating-point calculations on private inputs is privacy-preserving proximity testing on Earth [ŠG14] . We implement the formulas described in [ŠG14] with our floating-point building blocks and achieve faster runtime as well as higher precision compared to their protocols. This demonstrates that our automatically generated building blocks can outperform hand-built solutions.
Related Work
We classify related work into different categories next.
TinyGarble. Most related to our work is the recently proposed TinyGarble framework [SHS + 15] which was the first work to consider using hardware-synthesis tools to automatically generate circuits for secure computation. The authors used sequential circuits that allow to describe a circuit as a loop over a smaller sub-circuit (e.g., an -bit ripple-carry adder can be represented as iterating times over a single bit adder). Thereby, they are capable of generating highly compact circuit descriptions. Although this approach allows to represent the circuits in a highly memory-efficient way, the total number of gates that are evaluated securely and hence the communication and total number of crypto operations remains unchanged. As the main goal of TinyGarble was to assess the memory efficiency, the paper gives benchmarks only for evaluating a single circuit, the ripple-carry adder, with Yao's garbled circuits protocol.
As described before in §1, the GMW protocol has several advantages over Yao's garbled circuits protocol (precomputation, load balancing, multiple parties, etc.), but requires circuits with low depth. Unfortunately, sequential circuits cannot directly be applied to the GMW protocol, since the sequential circuit structure can significantly increase the depth of the circuit and thus the communication rounds required by GMW. Our work is the first to consider automated hardware synthesis of low-depth combinational circuits optimized for use in the GMW protocol, as well as size-optimized circuits for Yao's protocol. Our work also allows developers to write high-level Verilog code which can be automatically mapped to our optimized circuits by binding our circuit descriptions to arithmetic operators. Instead of using a domain specific input language, we use existing Hardware Description Languages (HDLs) such as Verilog or VHDL that are already known by many developers. Thereby, we can use existing code and allow a large community of developers to specify functionalities without the necessity of learning a new language.
Secure Computation Compilers from ANSI C. The following secure computation tools use a subset of the ANSI C programming language as input. CBMC-GC [HFKV12] initiated this line of development and used a SAT solver to generate size-optimized Boolean circuits from a subset of ANSI C. PCF [KSMB13] compiles into a compact intermediate representation that also supports loops, similar to the sequential circuits of TinyGarble described above. Both CBMC-GC and PCF target Yao's garbled circuits protocol and hence only optimize for size. PICCO [ZSB13] is a sourceto-source compiler that allows parallel evaluation and uses secure computation protocols based on linear secret sharing with at least three parties.
Although ANSI C is widely known as well, it has the drawback that some operations are either not supported (e.g., pointer arithmetic) or incur significant costs when compiled into a circuit (e.g., array access depending on private values). Thereby, existing C code sometimes needs to be rewritten or results in inefficient protocols. Although we do not eliminate these restrictions in our work, these issues do not occur when taking existing functionalities described in HDLs that do not support pointers and often avoid accesses to arrays with private indices, as these result in costly multiplexers. 
PRELIMINARIES
In this section we provide preliminaries and background related to the GMW protocol ( §2.1), hardware synthesis ( §2.2), and the IEEE 754 floating-point standard ( §2.3).
The GMW protocol
In the GMW protocol [GMW87] , two or more parties compute a function that is encoded as Boolean circuit. The parties' private inputs and all intermediate gate values are perfectly hidden by an XOR-based secret sharing scheme. GMW allows to evaluate XOR gates locally, without interaction, using only one-time pad operations and thus essentially for free. AND gates, however, require interaction in the form of Oblivious Transfers (OTs) or Beaver's multiplication triples [Bea91] that can be pre-computed in a setup phase, which is independent from the parties' private inputs and the function being computed. This pre-computation can be achieved efficiently by using OT extension [IKNP03, ALSZ13] as shown in [CHK + 12, SZ13]. After evaluating all circuit gates in the online phase, the output can be reconstructed by computing the XOR of the resulting output shares.
In order to achieve high performance, the total number of AND gates in the circuit (the circuit size S) and the number of AND gates from any input to any output wire (the circuit depth D) should be low. In this work we use the variant of the GMW protocol with two parties and security against passive/semi-honest adversaries.
Hardware Synthesis
Hand-optimizing Boolean circuits for secure computation is a tedious, error-prone and time-consuming task. Using hardware synthesis tools for synthesizing and optimizing these circuits, and even more complex circuits that cannot be easily hand-optimized, seems to be a promising and natural approach. As shown in TinyGarble [SHS + 15], using hardware synthesis tools allows to reduce the time and effort invested by further automating the process of generating optimized Boolean netlists in terms of circuit size and/or depth.
Overview. Hardware or logic synthesis is the process of translating an abstract form of circuit description into its functionally equivalent gate-level logic implementation using a suite of different optimizations and mapping algorithms that have been a theme of research over years. A logic synthesis tool is a software which takes as input a function description (functional, behavioral or structural description, state machine, or truth table) and transforms and maps this description into an output suitable for the target hardware platform and manufacturing technology.
Tools. Common target hardware platforms for synthesized logic include Field Programmable Gate Arrays (FPGAs), Programmable Array Logics (PALs), and Application Specific Integrated Circuits (ASICs). ASIC synthesis tools, as opposed to FPGA synthesis tools, are used in this work due to the increased flexibility and options allowed in their synthesis tools, and because FPGA synthesis tools map circuits into Look-up Tables (LUTs) and flip-flop (FF) gates in accordance with FPGA architectures, and not Boolean gates, which makes them unsuitable for this work. We used two main ASIC synthesis tools interchangeably: Synopsys Design Compiler (DC) [Syn10] which is one of the most popular commercial logic synthesis tools, and the open-source academic Yosys-ABC toolchain [Wol, Ber] . In the following, we focus on briefly describing the synthesis flow of Synopsys DC.
Synthesis Flow. A Hardware Description Language (HDL) description of the desired circuit is provided to Synopsys DC. Operations in this description get mapped to the most appropriate circuit components selected by Synopsys DC from two types of libraries: the generic technology (GTECH) library of basic logic gates and flip-flops called cells, and synthetic libraries consisting of optimized circuit descriptions for more complex operations. Designware [Syn15] is a built-in synthetic library provided by Synopsys, consisting of tested IP constructions of standard and complex cells frequently used, such as arithmetic or signal processing operations. This first mapping step is independent of the actual circuit manufacturing technology and results in a generic structural representation of the circuit. This gets mapped next to lowlevel gates selected from a target technology library to obtain a technology-specific representation: a list of Boolean and technology-specific gates (e.g., multiplexers), called netlist.
Synopsys DC performs all of the above mapping and synthesis processes under synthesis and optimization constraints, which are directives and options provided by the developer to optimize the delay, area and other performance metrics of a synthesized circuit.
Input to these hardware synthesis tools can be a pure combinational circuit, which maps only to Boolean gates, or a sequential circuit that requires a clock signal and FF gates which are memory elements to store the current state of the circuit. The output of a sequential circuit is a function of both the circuit inputs and the current state. In this work, we constrain circuit description to combinational circuits.
High-Level Synthesis. Logic synthesis tools accept the input function description most commonly in a HDL format (Verilog or VHDL), whereas more recent logic synthesis tools support high-level synthesis (HLS). This allows them to accept higher-level circuit descriptions in C/C++ or similar high-level programming alternatives. The HLS tools then transform the functional high-level input code into an equivalent hardware circuit description, which in turn can be synthesized by classic logic synthesis. Although this higher abstraction is more developer-friendly and usable, performance of resulting circuits is often inferior to HDL descriptions, unless heavy design constraints are provided to guide the mapping and optimization process.
The IEEE 754 Floating-Point Standard
Floating-point (FP) numbers allow to represent approximations of real numbers with a trade-off between precision and range. The IEEE 754 floating-point standard [FP008] defines arithmetic formats for finite numbers including signed zeros and subnormal numbers, infinities, and special "Not a Number" values (NaN) and rounding rules to be satisfied when rounding numbers during floating-point operations, e.g., rounding to nearest even. Additionally, the standard defines exception handling such as division by zero, overflow, underflow, infinity, invalid and inexact.
The IEEE 754 Standard 32-bit single precision floatingpoint format consists of 23 bits for significand, 1 bit for sign and 8 bits for exponent distributed from MSB to LSB as follows: sign [31], exponent [30:23], and significand [22:0]. The 64-bit double precision format consists of 52 bits for significand, one bit for sign, and 11 bits for exponent.
OUR TOOLCHAIN
We describe our toolchain here by presenting our architecture followed by a detailed description of each component.
Architecture
An overview of our architecture is shown in Fig. 1 . We provided the hardware synthesis tools with optimization and synthesis constraints along with a set of customized technology and synthesis libraries (cf. §3.2), to map the input circuit description in Verilog (or any other HDL) into a functionally-equivalent Boolean circuit netlist in Verilog. The output netlist, in the meantime, is constrained to consist of AND, XOR, INV and MUX gates. The Verilog netlist is then parsed and scheduled, and provided as input to the ABY framework [DSZ15] , which we extended to process this netlist and generate the Boolean circuit described in it. The evaluation of the GMW protocol in ABY minimizes the number of communication rounds, i.e., all AND gates on the same layer are evaluated in parallel.
In the following we describe in further detail the main components of our toolchain architecture: logic synthesis ( §3.2), scheduling ( §3.3), and extending the ABY framework ( §3.4).
Hardware and Logic Synthesis
The GMW protocol and Yao's protocol require that the function to be computed is represented as a Boolean circuit. As described in detail in §1.2, previous work, such as the Fairplay framework [MNPS04, BNP08] , used domain-specific high-level languages that allow a developer to describe the function to be computed, which in turn gets compiled into a Boolean circuit. Other compilers allow compilation of circuit descriptions written in C into size-optimized Boolean circuits, e.g., [HFKV12] , whereas further tools allow a developer to build up the circuit by instantiating its building blocks from within custom libraries composed of these building blocks, e.g., [HEKM11, Mal11] . All these works rely on custom-made compilers and/or languages which have to compile from a high-level description of the functionality and map it to a Boolean circuit. This may be considered as "reinventing the wheel" since Boolean mapping and optimization is the core of hardware synthesis tools, and has been researched for long. It has been argued, however, that such "hardware compilers" target primarily hardware platforms and therefore involve technology constraints and metrics which are not directly related to the purpose of generating Boolean circuits for secure computation. Writing circuits in HDL, such as Verilog or VHDL, is not entirely high-level, and involves hardware description paradigms which may not be similar to high-level programming paradigms. Furthermore, they rely on the use of sequential logic rather than pure combinational logic.
Exploiting Logic Synthesis. However, the TinyGarble framework [SHS + 15] exploited these very same points, and employed hardware synthesis tools in generating compact sequential Boolean circuits for secure evaluation by Yao's garbled circuits protocol [Yao86] . The work in our paper extends this further by using the hardware synthesis tools to generate combinational circuits of more complex functionalities for evaluation by both Yao and the GMW protocol [GMW87] , while excluding all design and technology optimization metrics. The synthesis and generation of the Boolean netlist by the synthesis tools (cf. §2.2) can be optimized according to the synthesis constraints and optimization options provided. Hardware synthesis tools conventionally target circuit synthesis on hardware platforms, but can be adapted and exploited for secure computation purposes to generate Boolean netlists which are AND-minimized (depth-optimized primarily for GMW or size-optimized for Yao's garbled circuits).
Customizing Synthesis
In the following, we focus on how we customized the synthesis flow of Synopsys DC to generate our Boolean netlists.
Synthesis Flow. The synthesis and optimization constraints that can be provided to Synopsys DC allow us to manipulate it to serve our purposes in this work, and generate depth-optimized circuit netlists for evaluation with GMW. Moreover, we developed a synthetic library of optimized basic cells and depth/size-optimized circuit building blocks that can be assembled by developers to build more complex circuits, and a customized technology library to constrain circuit mapping to XOR and AND gates only. The different libraries and our engineered customizations to achieve this are described next.
Synthetic Libraries. The first step of the synthesis flow is to convert arithmetic and conditional operations (if-else, switch-case) to their functionally-equivalent logical representations. By default, they are mapped to cells (either simple gates or more complex circuits such as adders and comparators) extracted from the GTECH library and the built-in Synopsys DC DesignWare library [Syn15] (cf. §2.2). A single cell can have different implementations from which the synthesis tool selects, depending on the provided constraints. For example, the sum of two -bit numbers can be replaced with 1 out of 10 different adder implementations available in both libraries, depending on the optimization constraints provided (optimizing for area or delay).
Our Optimized Circuit Building Blocks Library. Besides the standard built-in libraries, we developed our own DesignWare circuits in a customized synthetic library. It consists of depth-optimized circuit descriptions (arithmetic, comparators, 2-to-1 multiplexer, etc.) customized for GMW, as well as size-optimized counterparts for Yao's garbled circuits. Synopsys DC can then be instructed to prefer automated mapping to our customized circuit descriptions (cf. §4) rather than built-in circuits (cf. §3.2.3 for developer usage).
Technology Library. The intermediate generic representation of the circuit obtained in the step before is then mapped into low-level gates extracted from a technology library. A technology library is a library that specifies the gates and cells that can be manufactured by the semiconductor vendor onto the target platform. The library consists of the functional description (such as the Boolean function they represent) of each cell, as well as their performance and technology attributes such as timing parameters (intrinsic rise and fall times, capacitance values, etc.) and area parameters.
Technology libraries targeting ASICs contain a range of cells ranging from simple 2-input gates to more complex gates such as multiplexers and flip-flops. A single cell can also have different implementations which have varying technology attributes. Ultimately, the goal of the synthesis tool is to map the generic circuit description into a generated netlist of cells from this target technology such that user-provided constraints and optimization goals are satisfied.
Our Customized Technology Library. In order to meet our requirements of the Boolean circuit netlists required in this work, we constrain Boolean mapping to non-free AND and free XOR gates. However, Synopsys DC requires that synthesis runs with at least OR, AND and inverter (INV) gates defined in the technology library. We developed a customized technology library which has no manufacturing or technology rules defined, similar to the approach in TinyGarble, and we manipulated the cost functions of the gates by setting the area and delay parameters of XOR gates to 0, and set them to very high non-zero values for OR gates to ensure their exclusion in mapping. Their very high area and delay costs force Synopsys DC to re-map all instances of OR gates to AND and INV gates according to their equivalent Boolean representation (A∨B=¬(¬A∧¬B)), and to optimize the Boolean mapping in order to meet the specified area/delay constraints. We set the area and delay costs of an inverter (INV) gate to zero, as they can be replaced with XOR gates with one input buffered to constant one. For AND gates, the area and delay costs are set to reasonably high values, but not too high so that they are not excluded from synthesis. We set MUX gates to area cost equivalent to that of a single AND gate (since the 2-to-1 multiplexer construction in [KS08] is composed of a single AND gate and 2 XOR gates). And we set its delay cost equivalent to 0.25 times more than that of an AND gate to ensure preferred but also non-redundant mapping to MUX gates whenever feasible. We concluded that these settings give the most desirable mapping results after experimenting with Synopsys DC mapping behavior in different scenarios.
Synthesis Constraints. We provide constraints that make delay optimization of the circuit a primary objective followed by area optimization as a secondary objective when generating depth-optimized circuits for GMW. We set the preference attribute to XOR gates, and disable circuit flattening to avoid remapping of XOR gates to other gates. Synthesis tools are not primarily designed to minimize Boolean logic by maximizing XOR gates and reducing the multiplicative complexity of circuits within multi-level logic minimization. This is because XOR gates are only considered as "free" gates in secure computation applications, whereas in the domain of traditional hardware CMOS design, NAND gates are the universal logic gates from which all other gates can be constructed. Hence, the tools need to be heavily manipulated to achieve our objectives. These constraints and technology library settings also have to be customized differently when we want to generate circuits optimized for other secure computation protocols, such as Yao's garbled circuits.
Construction of More Complex Circuits. The customized circuit descriptions we developed can be used to build higher-level and more complex applications. We assembled complex constructions such as Private Set Intersection (PSI) primitives (bitwise-AND, pairwise comparison, and Sort-Compare-Shuffle networks as described in [HEK12] ) using our customized building blocks, and they have demonstrated equivalent AND gate count and depth as their handoptimized counterparts in [HEK12] . In general, all sorts of more complex functionalities and primitives can be constructed by assembling these circuit building blocks along with built-in Designware IP implementations. Consequently, these more complex circuits can then be appended to our library to be re-used in building further more complex circuits, and so on, in a modular and hierarchical way.
HDLs also allow a developer to describe circuits recursively which can be synthesized, which is often the most efficient paradigm for describing depth-optimized circuit constructions such as the depth-optimized "greater than" operation [GSV07] , the Waksman permutation network [Wak68] , or the Boyar-Peralta counter [BP06] .
High-level Function and Operator Mapping
An alternative to describing the circuits for HLS in highlevel C/C++ is to allow developers to input their circuit descriptions in high-level Verilog, by calling operators and functions, which we map to "instantiate" circuit modules such as depth-optimized adders or comparators from our customized synthetic library. This allows high-level circuit descriptions without incurring the drawbacks of using HLS tools, such as inferior hardware implementation (cf. §2.2).
Mapping operators. We prepared a library description which links our customized circuits into the Synopsys DC. This provides a description of each circuit module, its different implementations, and the operator bound to each module. These operators can be newly created, or already built-in, such as ('+', '-', '*', etc.), but bound to our customized circuits. For instance, when synthesizing the statement Z = X + Y, Synopsys DC is automated to map the '+' to our customized Ladner-Fischer adder, rather than a built-in adder implementation.
Mapping Functions. We mapped functions to instantiate circuit modules by creating a global Verilog package file which declares these functions and which circuit modules they instantiate when being called. This package file is then included in the high-level Verilog description code which calls on these functions.
Explicit Instantiation. Other more complex circuits can only be explicitly called from our customized building blocks library, as well as from the Designware IP library which offers a wide range of IP implementations, all of which have verified and guaranteed correctness, such as the floatingpoint operations we present and benchmark in §5.3. A list of available Designware IP implementations can be found in [Syn15] .
High-level Circuit Description Example. In Fig. 2 , we show how the depth-optimized constructions of the Manhattan, Euclidean and Hamming distances [SZ13] are described using high-level Verilog. The Manhattan distance between two points is the distance in a 2-dimensional space between these two points based only on horizontal and vertical paths. The Euclidean distance between two points computes the length of the line segment connecting them. Hamming distance between two strings computes the number of positions at which the strings are different.
In the Euclidean distance description, in lines 19 and 20 the '-' operator is mapped automatically to our LadnerFischer subtractor. The function sqr called in lines 23 and 24, is automatically mapped to instantiate our Ladner-Fischer squarer. We declared and bound this function correctly in the package file 'func_global.v' which is included in line 6. case statements (as are if...else statements) in lines 26-34 are also mapped to our depth-optimized multiplexer. In line 38, a carry-save network is explicitly instantiated from our library described in §4.2, since some circuit blocks are not mapped to functions and operators and have to be explicitly instantiated due to their structure and design. In the Manhattan distance description, the absolute differences are computed by calling the 'abs_diff' function in line 12 which is also mapped to instantiate the corresponding circuit. The same high-level abstraction can be seen in the Hamming distance description. Once these distance circuits are constructed, they can be appended to our blocks library to be easily re-used in more complex functionalities.
Developer Usage
By default, Synopsys DC maps operations to Designware circuit descriptions. For operations that have multiple circuit descriptions which are optimized for different parameters, e.g., area or delay, Synopsys DC selects the most appropriate circuit description which best satisfies the constraints provided by the developer in the synthesis script. Alternatively, the developer can explicitly select a specific circuit description to map an operation to. For example, the built-in Designware adder circuit is available in different implementations: ripple-carry, carry-look-ahead and other area-and delay-optimized implementations. Synopsys DC selects the most suitable implementation to map '+' to, depending on the developer-provided constraints. Furthermore, the developer can also specify in the synthesis script that a certain implementation is preferred, or the implementation can be explicitly called in the Verilog code.
In order for developers to use our synthetic libraries instead of Designware to map to our customized circuits, they have to decide for which metric to optimize: depth or size. Accordingly, developers add the libraries' paths and a single command in the synthesis script to direct Synopsys DC to optimize for either depth (for GMW) or size (for Yao), and to prefer mapping to which set of circuit descriptions. If developers want to instantiate a specific circuit description from our customized libraries, they can call it by the name of the circuit module and defining its input/output and parameters.
Optimization constraints are generally specified by the developer once for the entire top-level circuit description in the synthesis script, while some sub-circuits require specific optimization constraints. We already specified the optimization constraints for our customized circuit building blocks.
Challenges of Logic Synthesis for Secure Computation
Conventionally synthesis tools are best at synthesizing sequential hardware circuits with a clock input and flip-flops. This also means that the actual circuit netlists synthesized are much more compact than combinational Boolean circuits. However, for the purpose of this work, the netlists required are combinational to be evaluated with a secure computation protocol in the ABY framework. This implies synthesis of circuits which reach up to 10 million gates and beyond, which is time-and resource-consuming for hardware synthesis tools. In the hardware synthesis world, this can be managed by generating sub-blocks in a hierarchical fashion, and appending them into one top-level circuit.
However, in this work, one coherent Boolean netlist is required for a single functionality, hence all sub-blocks of a hierarchy must be un-grouped during synthesis, which is resource consuming. We use workarounds to ease the memory and resource requirements. However, this may come at the expense of inter-block optimization across block boundaries, but this can also be customized for individual synthesis scenarios by enabling the boundary optimization option when desired.
Scheduling
The output netlist generated from the hardware synthesis tools has to be parsed in an intermediate step before being provided to the ABY framework. A parser and scheduler topologically sorts and schedules the netlist gates [KA99] , since the Verilog netlist output from some synthesis tools is not topologically sorted, i.e., a wire can be listed as input to one gate before assigning output to it from another. The scheduler generates a Boolean netlist in a format which is similar to Fairplay's SHDL [MNPS04] . All gates and wires are renamed to integer wire IDs for easier processing by the ABY framework, and complex statements are rewritten as one or several available gates. These steps ensure that the final netlist contains only AND, XOR, INV and MUX gates.
Extending the ABY Framework
The open-source ABY framework [DSZ15] is an extensive tool that enables a developer to manually implement secure two-party computation protocols by offering several low-level as well as intermediate circuit building blocks that can be freely combined. We extended the ABY framework with an interface where externally constructed blocks made of low-level gates can be input in a simple text format, similar to SHDL [MNPS04] and the circuit format from [ST] , that we can parse as well, with some modifications.
This interface is used to input the parsed and scheduled netlists from our hardware synthesis. ABY creates a Boolean circuit with low depth from that input netlist, i.e. it schedules AND gates on the earliest possible layer and automatically processes all AND gates in one layer in parallel. A developer has two options: 1) our hardware synthesized netlist can be used as a full protocol instance from private inputs to output or 2) the netlist's functionality can be used as a building block and combined with other synthesized or handbuilt sub-circuits within ABY in order to create the whole secure computation protocol. The output of ABY is a fully functional secure computation protocol that is split into setup phase and online phase, that can be evaluated on two parties' private inputs.
BULIDING BLOCKS LIBRARY
We implemented the following blocks in Verilog as pure combinational circuits and synthesized their Boolean netlists using both Synopsys DC and Yosys-ABC interchangeably to show that the framework is independent of the used synthesis tool. All implemented circuits have configurable parameters such that they can handle the desired bit-width of the inputs and/or number of inputs n. We summarize and compare our synthesis results with their hand-optimized counterparts in [HKS + 10, HEK12, SZ13]. The two main comparison metrics are size S which is the circuit size in terms of non-free AND gates, and depth D which is the number of AND gates along the critical path of the circuit. XOR gates are considered to be free, as the GMW protocol and Yao's protocol with free XORs [KS08] allow to securely evaluate XOR gates locally without any communication. Next we show the results for functionalities that have improved depth or size compared with their hand-optimized counterparts in §4.1, and then in §4.2 we describe further functionalities and blocks that we have implemented in our library which show equivalent results as their hand-optimized counterparts. Finally, in §4.3, we describe the floating-point operations and integer division that we benchmark in §5.
Improved Functionalities
In this section, we present the implemented functionalities that achieved better results in terms of size or depth compared with [HKS + 10, SZ13]. Results are given in Tab. 1.
Ladner-Fischer LF Adder/Subtractor. The LF adder/ subtractor has a logarithmic depth [LF80, SZ13] . Our results show improvement for both depth (up to 10%) and size (up to 14%) in the subtraction circuit, while maintaining the same size and depth for addition of power-of-two numbers. Both circuits can also handle numbers that are not powersof-two and achieve better size (up to 20%) as the hardware synthesis tool automatically removes gates whose outputs are neither used later as inputs to other gates nor assigned directly to the output of the circuit.
Karatsuba Multiplier KMUL. We implemented a recursive Karatsuba multiplier [KO62] using a ripple-carry multiplier for inputs with bit-width < 20, while for ≥ 20 inputs are processed recursively. We compare our results with numbers given in [HKS + 10], which generated size-optimized Boolean circuits for garbled circuits, but did not consider circuit depth. Here we achieve up to 3% improvement in size.
Manhattan Distance DST M . Manhattan distance is implemented as a depth-optimized circuit using Ladner-Fischer addition ADDLF and subtraction SUBLF or using ripplecarry addition ADDRC and subtraction SUBRC for a sizeoptimized circuit [CHK + 12, SZ13]. Our results demonstrate improvements in terms of size (up to 16%) and depth (up to 13.6%).
Further Functionalities
We list further functionalities that we implemented next. Their circuit sizes and depths are equivalent to the handoptimized circuits in [HEK12, SZ13] : ripple-carry adder and subtractor [BPP00, KSS09] , n × -bit carry-save and ripplecarry network adders [Sav97, SZ13] , multipliers and squarers [Sav97, KSS09, SZ13], depth-optimized multiplexer [KS08] , comparators (equal and greater than) [SZ13] , full-adder [SZ13] and Boyar-Peralta counters [BP06, SZ13] , and the SortCompare-Shuffle circuit for private set intersection (PSI) [HEK12] and its building blocks (bitonic sorter, duplicatefinding circuit, and Waksman permutation network [Wak68] ).
Matrix Multiplication. We implemented a size-optimized matrix multiplication circuit that computes one entry in the resulting matrix by computing dot products. This circuit is evaluated such that it computes the entries of the resulting matrix in parallel. Thereby, we can exploit the capability of the ABY framework to evaluate circuits in parallel, which reduces the memory footprint of the implementation. The circuit uses the Karatsuba multiplier and a ripple-carry network adder. It is configurable, i.e., we can set the bit-width and the number of elements per row or column n. The depths and sizes of these circuits are given in Tab. 3 and their performance is evaluated in §5.2.
Floating-Point Operations and Integer Division
We generated floating-point operations using the DesignWare library [Syn15] , which is a set of building block IPs used to implement, among other operations, floating-point computational circuits for high-end ASICs. The library offers a suite of arithmetic and trigonometric operations, format conversions (integer to floating-point and vice versa) and comparison functions. The provided functionalities are parametrized allowing the developer to select the precision based on either IEEE single or double precision or set a custom-precision format. We can also enable the ieee_compliance parameter when we need to guarantee IEEE compatible floating-point numbers ("Not a Number" NaN and denormalized numbers). Some functionalities provide an arch parameter which can be set for either depth-optimized or size-optimized circuits.
Some of the floating-point functions provide a 3-bit optional input round, to determine how the significand should be rounded, e.g. 000 rounds to the nearest even significand which is the IEEE default. They also have an 8-bit optional output flag status, in which bits indicate different exceptions of the performed operation allowing error detection. We can choose to truncate or use these status bits as desired.
We generated circuits for floating-point addition, subtraction, squaring, multiplication, division, square root, sine, cosine, comparison, exponentiation to base e, exponentiation to base 2, natural logarithm (ln), and logarithm to base 2 for single precision, double precision and a custom 42-bit precision format for comparison with [ABZS13] . The 42-bit format consists of 32 bits for significand, one bit for sign and 9 bits for exponent distributed from MSB to LSB as follows: sign [41], exponent [40:32] and significand [31:0]. We extended the ABY framework with these floating-point operations and benchmarked them. We give runtimes, depths and sizes for various floating-point operations in §5.3.
We also generated circuits for integer division for different bit-widths ∈ {8, 16, 32, 64} using the built-in DesignWare library [Syn15] . Another possibility for generating division circuits is to use the division operator '/' which will be implicitly mapped to the built-in division module in that library. As we optimize for depth our circuits have size O( 2 log ) ≈ 24 576 gates for = 64 but low depth 512. In contrast, optimizing for size would yield better size O( 2 ) ≈ 3 2 = 12 288 gates (for ADD/SUB, CMP, and MUX), but worse depth O( 2 ) = 4 096. We give circuit sizes and depths for integer division in Tab. 2 and benchmarks in §5.1.
BENCHMARKS AND EVALUATION
We extended the ABY framework [DSZ15] to read in the parsed and scheduled netlist generated by our hardware synthesis tool and evaluate it with ABY's optimized implementations of the GMW protocol and Yao's garbled circuits (cf. §3.4). In contrast to TinyGarble [SHS + 15], which mainly focused on a memory-efficient representation of the circuits and gave only a single example for the time to securely evaluate the circuit, we measure the total execution times for several operations and applications: integer division ( §5.1), matrix multiplication ( §5.2) and an extensive set of floating-point operations ( §5.3). For Yao's protocol we use today's most efficient garbling schemes implemented in the ABY framework [DSZ15] : free XOR [KS08] , fixed-key AES garbling with the AES-NI instruction set [BHKR13] and half-gates [ZRE15] . For better comparability of the runtimes we use depth-optimized circuits for both, GMW and Yao.
Compilation and synthesis times for the largest circuits (FP EXP2 , FP DIV ) using Synopsys DC are under 1 hour on a standard PC, but this is only a one-time expense, after which the generated netlist can be re-used without incurring compilation costs again.
We provide runtimes for the setup phase, which can be pre-computed independently of the private inputs of the participants and the online phase, which takes place after the setup-phase is done and the inputs to the circuit are supplied by both parties. All runtimes are median values of 10 protocol runs. We measured runtimes on two desktop computers with an Intel Core i7 CPU (3.5 GHz) and 16 GB RAM connected via Gigabit-LAN. In all our experiments we set the symmetric security parameter to 128 bits.
Benchmarks for Integer Division
A complex operation that is not trivially implementable by hand is integer division, as described in §4.3. In Tab. 2 we list the runtime, split in pre-computation phase and online phase and list the circuit parameters for multiple input sizes. We compare our runtime with the runtime prediction of 32-bit integer long division of [KSS13] which we speed up by a factor of 32 and even more for Single Instruction Multiple Data (SIMD) evaluation. We also compare with the runtime of 3-party 64-bit integer division of [ABZS13] , which outperforms our single evaluation with GMW by a factor of 1.8. However, for parallel SIMD evaluation we improve upon their runtime by up to factor 3.7. When comparing to the 3-party 32-bit integer division of [BNTW12] , we achieve a speedup of 6.5 for single execution, while we require more than 5 times the runtime for 10 000 parallel executions.
Benchmarks for Matrix Multiplication
Matrix multiplication of integer values is an important use case in many applications. Here we exploit ABY's ability to evaluate circuits in parallel in a SIMD fashion and instantiate dot product computation blocks, each of which calculates a single entry in the result matrix. In Tab. 3 we give the runtimes for dot product computations of 16 values of 16 bit each or 32 values of 32 bit each, as described in §4.2. We compare with the 3-party secret-sharing based implementations of [BNTW12, ZSB13] as well as the 2-party arithmeticsharing implementation of the ABY framework [DSZ15] . For this comparison we use the values reported in the respective papers and interpolate them to our parameters.
The secret-sharing or artihmetic-sharing based solutions outperform our Boolean Circuits by several orders of magnitude due to their much faster methods for multiplication.
Benchmarks for Floating-Point Operations
There is a multitude of use cases for floating-point operations in academia and industry, ranging from signal processing to data mining, but due to the complexity of the format it has only recently been considered as application for secure computation [FK11] . Until today there are only few actual implementations of floating-point arithmetic in secure computation, all of which use custom-built protocols [ABZS13, KW14] . Instead, we use multiple standard floating-point building blocks offered by Synopsys DC and synthesize them automatically (cf. §4.3). Tab. 4 depicts the runtime in ms per single floating-point operation, when run once or multiple times in parallel using a SIMD approach. We compare our results for Yao and GMW with hand-optimized floating-point protocols of [ABZS13] , who used a 3-party secret sharing approach with security against semi-honest adversaries and desktop computers connected on a Gigabit-LAN for their measurements. The largest runtime improvements can be achieved when evaluating our generated circuits in parallel. We improve the runtime by up to a factor of 21 for parallel evaluation and show similar or somewhat improved runtimes for the lower parallelism levels reported. We can improve upon many results of [KW14] which is in the 3-party setting, except for highly parallel multiplication. We show that our automatically generated circuits are able to outperform hand-crafted circuits in many cases, especially for high degrees of parallelism. We give an application for floating-point arithmetic in §6.
Benchmark Evaluation
In general, when comparing the implementations of Yao and GMW in the ABY framework, we show that Yao outperforms GMW in most cases but scales much worse, up to a point where the largest circuits cannot be evaluated in parallel, due to the high memory consumption of Yao's protocol. GMW remains beneficial for highly parallel protocol evaluation, as the more critical online time scales almost linearly with the level of parallelism. The setup times of Yao and GMW are similar for all parameters. Our improved performance stems from both, the optimized circuits generated by the state-of-the-art hardware synthesis tools which we manipulate to optimize the circuits for either depth or size, and from the efficient implementation of GMW and Yao's garbled circuits with most recent optimizations in ABY. Since both protocols are based on Boolean circuits, we improve the performance of operations that require many bit operations. Operations that involve many integer multiplications are better suited for solutions based on arithmeticor secret-sharing.
APPLICATION: PRIVACY-PRESERVING PROXIMITY TESTING ON EARTH
As application for secure computation on floating-point operations, we consider privacy-preserving proximity testing on Earth [ŠG14] . Here, the goal is to compute if two coordinates CA and CB input by party A and B respectively are within a given distance : D(CA, CB) < . This is a useful but rather privacy-critical use case that has many applications, such as finding nearby friends, points of interest or targeted advertising, and is widely used with the recent spread of end-user GPS receivers and geo location via IP addresses. The authors of [ŠG14] present and compare three different distance metrics: UTM, ECEF, and HS described below. In their paper, the authors design secure protocols based on additively homomorphic encryption (HE) or Yao's garbled circuits (GC) that require to quantize all values to integers, which means a loss of precision. Instead, our framework allows to compute the distance formulas directly on floatingpoint numbers with multiple precision options available and thus can offer a higher precision.
Universal Transverse Mercator (UTM). This distance metric maps Earth over a set of planes and provides accurate results if A and B are located relatively close to each other, within the same UTM zone.
In this metric coordinates are expressed as 2-dimensional points: CA = (xA, yA) and CB = (xB, yB). DUTM(CA, CB) < ⇔ (xA − xB) 2 + (yA − yB) 2 < 2 , where underlined variables are inputs of party A and the other terms are inputs of party B. For computing this formula we need 2 FP SQR , 3 FP ADD , and 1 FP CMP operations.
Earth-Centered, Earth-Fixed (ECEF). This distance metric uses the Earth-Centered, Earth-Fixed (ECEF, also known as Earth Centered Rotational, or ECR) coordinate system which provides very accurate results when the parties are far apart.
The coordinates are expressed as 3-dimensional points where (0, 0, 0) is the center of the Earth: CA = (xA, yA, zA) and CB = (xB, yB, zB).
DECEF(CA, CB) < ⇔ (xA − xB) 2 + (yA − yB) 2 + (zA − zB) 2 < 4R 2 a , with a = (tan 2R ) 2 1 + (tan 2R ) 2 . Underlined variables are inputs of party A and the other terms are inputs of party B. Computing this formula takes 3 FP SQR , 5 FP ADD , and 1 FP CMP operations.
Haversine (HS). This distance metric is based on the haversine (HS) formula which is a trigonometric formula used to compute distances on a sphere and is very accurate regardless of the position of A and B.
The coordinates are expressed as spherical coordinates with latitude (lat) and longitude (lon): CA = (latA, lonA) and CB = (latB, lonB).
DHS(CA, CB) < ⇔ α 2 ·β 2 −2αγ ·βδ +γ 2 ·δ 2 +ζθ 2 ·ηλ 2 −2ζθµ·ηλν +ζµ 2 ·ην 2 < a , with a as defined above and α = cos(latA/2) γ = sin(latA/2) ζ = cos(latA) θ = sin(lonA/2) µ = cos(lonA/2) β = sin(latB/2) δ = cos(latB/2) η = cos(latB) λ = cos(lonB/2) ν = sin(lonB/2). Underlined terms are inputs of party A while all other terms are inputs of party B. Computing this formula requires 6 FP MULT , 5 FP ADD , and 1 FP CMP operations.
Performance. We implemented the three proximity testing algorithms from [ŠG14] using our floating-point building blocks. In Tab. 5 we compare the runtime of the original implementation of [ŠG14] that uses homomorphic encryption (HE) and Yao's Garbled Circuits (GC) with our implementation based on GMW and Yao for single and parallel evaluation. We are able to achieve better runtimes for single executions of the protocol (by factor 6.2 for HS and more than factor 14 for UTM and ECEF), and more than two orders of magnitude speedup for highly parallel execution. Thereby, we show that our approach allows to substantially improve upon the runtime of hand-crafted protocols while at the same time it benefits from the heavily tested and verified circuit building blocks from industrial-grade hardware synthesis libraries. 
