Abstract. A number of homomorphic encryption application areas, such as privacy-preserving machine learning analysis in the cloud, could be better enabled if there existed a general solution for combining sufficiently expressive logical and numerical circuit primitives to form higherlevel algorithms relevant to the application domain. Logical primitives are more efficient in a binary plaintext message space, whereas numeric primitives favour a word-based message space before encryption. In a step closer to an overall strategy of combining logical and numeric operation types, this paper examines accelerating binary operations on real numbers suitable for somewhat homomorphic encryption. A parallel solution based on SIMD can be used to efficiently perform addition, subtraction and comparison operations in a single step. The result maximises computational efficiency, memory space usage and minimises multiplicative circuit depth. Performance of these primitives and their application in min-max and sorting operations are demonstrated. In sorting real numbers, a speed up of 25-30 times is observed.
Introduction
Homomorphic Encryption (HE) is seen as an important technology for enabling computation and analytical processing on encrypted data outsourced to third parties. This would greatly facilitate mainstream utilisation of public cloud resources for sensitive data analysis with a significantly reduced chance of information leakage should any system breach occur. Fully Homomorphic Encryption (FHE) schemes, which support addition and multiplication operations on the same ciphertext, have only recently come about starting from the work on ideal lattices by Craig Gentry in 2009 [1] . Although the original scheme was highly inefficient, newer generation FHE schemes have since rapidly evolved that are several orders more efficient though not yet practical enough to be considered for mainstream application use.
This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.
Somewhat Homomorphic Encryption (SWHE) schemes remove the requirement of bootstrapping or recryption which typically takes up a very significant amount of the total FHE execution time. This however places a limit on the number of combined operations that can be performed on the 'noisy' ciphertexts before successful decryption is no longer possible. Considerable research has been dedicated so far to developing various optimisations and batching techniques to improve the efficiency of SWHE schemes and to maximise the depth of the circuits that can be evaluated without bootstrapping. A number of newer generation schemes were developed based on the more efficient and simple Learning with Error (LWE) problem introduced by Regev in 2009 [2] (and its later ring variant) rather than on lattices, although the hardness problems were related [3] . A FHE scheme based on key-switching and modulus-switching optimisation techniques (see Section 3.2) became known as the BGV scheme named after its authors Brakerski, Gentry and Vaikuntanathan. Other SWHE scheme variants include Ring-LWE (RLWE) scale-invariant versions and NTRU-based schemes [4] .
Despite the aforementioned optimisations, there are still significant (space and time) demands on computational resources and network bandwidth to overcome. The fact is that SWHE schemes are obliged to have either a massive ciphertext expansion or an 'ugly' algebraic structure in order to maintain their security [5] . Ciphertext expansion ratios are typically in the order of GBs in size for only a relatively small amount of plaintext input, resulting in high computational and communication costs [6] . Ciphertext packing techniques such as homomorphic Single Instruction Multiple Data (SIMD) proposed by Smart and Vercauteren [7] , maximise memory space usage and achieve parallelism of repeated operations by enabling the combination of multiple smaller ciphertexts into a single larger ciphertext.
Currently available SWHE implementations include additional optimisations to improve efficiency of processing large ciphertexts. A prominent open-source RLWE implementation of the BGV scheme called HElib 1 by Halevi and Shoup [8] incorporates many such optimisations (see Section 3.3) , and is used to efficiently implement algorithms described in this paper.
Much research effort has also been dedicated around circuit design to support various HE application areas. The main aim is to optimise use of the SWHE tools as much as possible by designing algorithms in a way that reduces both the consecutive and total number of multiplications. The former minimises circuit depth to avoid bootstrapping while the latter reduces overall computational cost. SIMD techniques can be used to reduce the occurrence of both multiplication instances, in addition to maximising memory space usage. SWHE tools are often applied in highly tailored ways against the particular application scenarios, in order to deal with narrow parameter and computational efficiency constraints. Tightly coupling HE methods in this way limits their generalisability. The extent to which a HE solution can support general computing functions across a wide scope of application areas depends on the availability of sufficiently expressive composable cryptographic primitives designed to work efficiently within a HE context.
Developing even the most basic primitives such as simple numeric and logical operations require non-trivial choices, including whether to perform evaluations over binary or arithmetic circuits (or polynomial rings), as well as the choice of data encoding. These factors have a significant bearing on low-level primitive circuit performance, which have a further flow-on effect on the performance of higher-level algorithms (eg., min-max and sorting methods), more directly relevant to practical use cases [9] .
Binary circuits, which rely on encoding data at the bit-level before encrypting (ie., use a plaintext message space of Z 2 ) naturally support boolean comparison operations, whereas they are very slow in performing arithmetic operations [10] . Word-based encoding (ie., encrypting as elements of a plaintext modulus of p or a Z p message space, for a large enough p) allows for simpler and more efficient ciphertext multiplication and addition that are easily parallelised; this is however at the cost of more complex comparison and thresholding operations which are fundamental to many of the HE application areas [10] .
In the area of data encoding, ciphertexts encoded as real numbers represent the most practical solution for most application areas. Techniques to represent fixed point numbers (encoded as polynomials) include scaled integer and fractional representations [11] . Encoding floating point numbers require independent storage of the encrypted significand from the exponent, necessitating separate homomorphic operations on each during calculations [12, 13] . Options also exist for encoding negative numbers, using either a sign-magnitude or two's complement approach [9] . The approaches used to represent real numbers (fixed or floating point) in either type of plaintext message space (Z 2 or Z p ) can have a significant impact on the efficiency of numeric and/or logical operations [9] .
Our Contribution
The research presented in this paper aims to minimise circuit depth, and to maximise memory space utilisation and HE computational performance on real numbers under a binary message space. The approach combines primitive logical functions together with binary addition and subtraction operations using SIMD techniques. The result is to permit natural comparison and thresholding operations using binary circuits while at the same time facilitating an efficient amortised rate for addition and subtraction operations performed in parallel. Such a technique would find use in particular implementation instances across all HE applications areas where combinations of logical and numeric operations are required. The technique would also complement other strategies for performing arithmetic operations, including multiplication, in larger (p > 2) message spaces. Finally, based on this approach, performance of the fundamental basic primitive operations and their application to various more complex SWHE algorithm implementations on real numbers are demonstrated using HElib.
Related work on encoding, addition, equality and comparison HE operations will be discussed during development of our binary circuit primitives in relevant parts of Sections 3 and 4. The outcome of our circuit development is to enable more efficient complex homomorphic operations, which are derived from these basic circuits, using parallel batching techniques. Two candidate algorithms for demonstrating potential application of these primitives including sorting and finding the minimum or maximum (min-max) from a group of encrypted numbers.
The task of sorting encrypted numbers using FHE was first implemented in 2013 by Aguilar-Melchor et al. [14] using a Bubble sort, and by Chatterjee et al. [15] using a two-stage Bubble and Insertion sort. The hcrypt 2 library, based on the work of Smart and Verteceuran [16] was used in both cases, although the former also experimented with an early BGV-based scheme. Chatterjee examined minimising the recryption operation and as was able to achieve sorting of 40 (32-bit) integers in 1399 seconds. The hcrypt library however does not take advantage of modern noise reduction or SIMD batching techniques [17] , which are factors both relevant to additional improvements in sorting performance.
The seminal work by Ç etin et al. [18] in 2015 implemented two low-depth sorting algorithms both relying on a matrix to perform all comparisons at the outset, and achieving a multiplicative circuit depth of O(log N + log n), where N is the total number elements to be sorted, and n is the number of bits encoding each element. Ç etin also introduced the concept of using SIMD to accelerate sorting performance, although it is used in this case to sort multiple lists rather than accelerate sorting of a single list. The amortised running time per sort is reduced since it takes the same amount of time to sort one list compared to multiple groups of similar-sized lists up to the total number of available ciphertext slots, s. Sorting one list however inefficiently uses one ciphertext per bit, therefore the implementation is computationally expensive. Using an NTRUbased SWHE scheme, sorting 16 (32-bit) integers is inferred to take about 45 mins. Dai & Sunar [19] boosted the SWHE scheme's sorting performance using a GPU-accelerated library 3 to achieve a speed-up of 12-41 times (reportedly, 1 min 38 sec to sort 16 integers of size 32 bits).
Kim et al. [20] , appears to be the first work to report using SIMD batching to accelerate sorting performance within a single list. The authors employ HElib to implement a bitonic sorting algorithm, which is a data independent (ie. constant number of comparisons irrespective of input) sorting algorithm, consisting of recursive sort and merge operations. Only half of the available data slots are used to insert elements within its ciphertext packing structure. The sorting network is made up of two bitonic (ie. increasing then decreasing) sub-sequences, each initially with two elements for the base case. The bitonic sort algorithm has a O(N log 2 N ) comparison complexity and a multiplicative depth of O(log n · log 2 N ). Additional details can be found in [20, 21] . Whilst benchmarking Ç etin's work, [20] argues that the multiplicative circuit width in addition to its depth are important in determining the number of recryptions required for sorting large lists, which is the most dominant bottleneck in all FHE operations. The sorting algorithm based on the comparison matrix (having O(N 2 ) comparators) was shown to have a large width, and ultimately requiring more recryptions compared to the bitonic sort, despite the latter having a larger multiplicative circuit depth. An improvement in the number of recryptions was confirmed using real implementations of the sorting algorithms, although performance statistics on smaller sorting sets was not shown.
Literature specifically looking at finding min-max on encrypted number sets using FHE is limited. Kocabaş et al. [22] uses the HElib library for min-max heart rate computation involving a logical comparison circuit examined later in Section 5.3. A log 2 N stage binary tree is used to repeatedly applying min-max on N ciphertexts packed with a vector of n-bit integers, resulting in an overall circuit depth of (log 2 n + 2) · log 2 N . It would appear further processing is necessary (additional circuit depth of log 2 n) to extract the final minimum or maximum value within the output ciphertext in a step not apparently described; perhaps being recovered after decryption. The min-max algorithm described by Togan et al. [23] uses a recursive strategy to combine, in the base case, a one bit integer 'greater than' comparison relation with a selection operator both expressed in polynomial form. The overall circuit depth is similar to the approach by Kocabaş, although no ciphertext packing is involved. A HElib-based implementation finds the maximum value from 16, 8-bit integers in 21 minutes. Set at a 140 bit security level, the implementation is very memory intensive at 3.8 GB.
It is noteworthy that all literature reviewed only examines sorting and finding min-max from sets of encrypted integers rather than real numbers, despite both number types requiring an identical amount of processing time to sort when correctly encoded, for a given bit size.
Preliminaries

Homomorphic Encryption
Any HE scheme consists of four probabilistic polynomial time algorithms Keygen, Enc, Dec, Eval. In this case, all algorithms implicitly take in a number of parameters as input including dimension n, modulus q and error distributions, according to RLWE constructions:
-Key generation (pk, sk ← Keygen(1 λ )): takes in security parameter λ as input and outputs a public encryption key pk and a secret decryption key sk.
-Encryption (c ← Enc(pk, µ)): takes pk and a plaintext µ in a message space M defined by some ring R M . The output is a ciphertext c in space C (and ring R C ).
-Decryption (µ ← Dec(sk, c)): decrypts c back to µ using the secret key sk.
-Evaluation (c f ← Eval(pk, f, c)): characterises a HE scheme, taking as input the public key pk, a function f : R l M → R M and a tuple of l ciphertexts c 1 , ..., c l , to output a ciphertext c f .
In this work, f will be represented by an arithmetic circuit over GF(2) that can be evaluated by a SWHE scheme to a limited depth so that Dec can recover µ correctly without requiring bootstrapping.
The BGV SWHE Scheme
The RLWE variant of the BGV scheme can be represented as follows [3, 8] : With λ as the security parameter (ie., all known valid attacks taking Ω(2 λ ) bit operations) and an integer m = Ω(λ) that defines the m-th cyclotomic polynomial Φ m (X), we define the polynomial ring A = Z[X]/Φ m (X). m, which roughly corresponds to the dimension of the underlying lattice, determines the size of computation and hence represents efficiency of the encryption scheme. The plaintext message space is set to
(although in this work we will always use A 2 ). The ciphertext space is set to A q := A/qA for an odd integer modulus q with a degree up to φ(m) − 1 4 . Ciphertexts are 2-element vectors defined over A qi (or ring R qi ) at any particular level i, ie., c = (c 0 , c 1 ) ∈ R 2 qi . In the modulus-switching procedure, by switching to a ciphertext with a smaller modulus (from q i to q i−1 ), an increase in the number of multiplications could occur from log L to L levels before bootstrapping is required and without involving a secret key [3, 24] . At each ciphertext level (L, ...1, 0), the modulus decreases with every application of modulus-switching to reduce the noise during homomorphic evaluation. No further noise reduction is possible in level 0, requiring bootstrapping to enable further computation.
As there are a few variations to the RLWE assumptions involved in encryption, we use that proposed by Lyubashevsky, Peikert, Regev (LPR) as a practical approach [25] . In set-up of the scheme, a (discrete Gaussian) noise distribution χ is defined over the set of all ring elements A q with a small norm bounded by some B. Sample s ← χ and e ← χ. Generate a 1 ← R q L uniformly at random and calculate a 0 = [−(a 1 s + pe)] q L . Set the secret key sk = s and the public key pk = (a 0 , a 1 ).
To encrypt a plaintext polynomial m ∈ A p , sample u, g, h ← χ and compute the ciphertext which is initially in level L:
At any level, i the equality over
, which is the decryption formula. The term [c 0 + sc 1 ] qi is the 'noise' in the ciphertext that grows with each homomorphic evaluation. The ciphertext is valid as long as this noise does not wrap around modulo q i , which gives bound on the norm of the noise to be always below B. Under the RLWE assumption, the encryption scheme is semantically secure (ie., against honest-but-curious adversary behaviour) if the public key (a 0 = −(a 1 s + pe), a 1 ) ∈ R 2 q , and by extension the ciphertext (c 0 , c 1 ), are computationally indistinguishable from uniform in R 2 q . Homomorphic addition involves simply adding two ciphertext vectors under the same key s over A qi . The noise of the added vector is at most 2B. Homomorphic multiplication involves a tensor product over A qi resulting in squaring of both ciphertext and (new self-tensored) secret key dimensions. While the current modulus remains unchanged, the noise of the product ciphertext is bounded by B 2 . A key-switching (or relinearisation) technique is required to reduce the dimension of tensored secret key and ciphertexts back to original dimensions. At each multiplication level, the tensored secret key would be encrypted under a different secret key as a hint to facilitate conversion to the new valid encryption. A single key however is only required for all levels if it can be assumed safe to encrypt the secret key under its own public key (circular security assumption) [3] .
SIMD makes use of the fact that smaller plaintext spaces could collectively be considered a vector of independent plaintext slots (for performing componentwise addition or multiplication) when encrypted into much larger ciphertext spaces by virtue of the Chinese Remainder Theorem (CRT). Since the plaintext space is over
. Each factor corresponds to a 'plaintext slot' of degree d, which is the smallest integer such that p d = 1 mod m, and thus the number of slots, s = φ(m)/d. The plaintext polynomial a ∈ A p therefore can be viewed as a vector of s different small polynomials, (a mod
The slot values can represent elements of the extension field F p d (or embedded elements of the subfield F p n where n divides d), rather than individual bits [26] .
Implementing linear rotations and shifts to move data across plaintext slots can be achieved with an automorphism operation without any increase in ciphertext noise [8] . The automorphism is represented by the transform, a → a
where
Often a combination of automorphism operations with selective slots masked out is necessary to achieve arbitrary permutations on packed ciphertexts. Key-switching is also required to transform a ciphertext involved in an automorphism into another valid ciphertext that is decryptable with respect to the original secret key. In this case, secret key and ciphertext dimensions remain unchanged. Note, there is a computational cost involved in combining automorphisms followed by key switching to achieve permutations on plaintext slots. For further details, refer to [26] .
HElib
HElib includes many optimisations over the basic BGV scheme, including support for recryption and SIMD. Due to a specialised key-switching procedure, each q i is a product of smaller set of primes p j , (ie., q i = i j=0 p j ). As a result, ciphertext elements of the ring A q i are represented with respect to both polynomial vectors modulo each small prime and the primitive m'th roots of unity evaluations generated using the Fast Fourier Transform (FFT). This so-called DoubleCRT format allows point-wise addition and multiplication of elements in A q i in linear time [27, 8] . The encryption method involved in HElib varies slightly compared to the aforementioned LPR method: It uses a native plaintext space of R p r for a prime power p r . The secret key, s is the 2-vector (1, s) ∈ R 2 , where s is chosen randomly from {0, ±1}
φ(m) with a recommended Hammingweight (ie., number of non-zero coefficients) of 64. If Q ct is the product of all ciphertext primes, initially generate c 1 ← A Qct uniformly at random and e ← χ where χ ∈ A Qct , and set
r e in R where m ∈ R p r is the message plaintext and p r e is the error term, which has a small norm relative to q i [8] .
Encoding Real Numbers
As mentioned, ease of implementation and efficiency of operations on binary circuits relies on the specific encoding method used to represent real numbers. A fixed point representation is the simplest one for demonstrating circuit operations, which can be applied to the polynomial plaintext ring, R p and whose coefficients represent bits in a binary sequence. Two methods proposed by Dowling et al. [11] use scaled-to-integer and fractional encoding representations of fixed point real numbers, which are demonstrated by Costache et al. [28] to be isomorphic under the same power of two cyclotomic ring. Strictly speaking, fractional encoding cannot operate under a binary message space whereas the scaled-tointeger approach can perform computations over both binary or arithmetic circuits. It is nevertheless worthwhile considering both fixed point representations as researchers have previously sought to avoid costly bit operations by turning to the plaintext space, R p [28] .
The scaled-to-integer approach scales a binary representation of a real number by some power of 10 into an integer according to some fixed digit precision. Despite literature criticism about the requirement for complex book-keeping in order to ensure ciphertexts are correctly scaled in homomorphic operations [11, 28] , we use the approach suggested by Jäschke et al. [9] of multiplying a real number by a power of two according to k bits of precision. In this case, the output of each homomorphic multiplication between two real numbers should be truncated by deleting the last k bits (equivalent to dividing by 2 k ), to bring the product back to the required precision. Under this approach, there is no requirement to keep track of scaled products involved in computations before final conversion to the decrypted output, which could otherwise potentially leak data or computation information to an adversary.
The fractional encoding scheme is implemented over the cyclotomic ring R = Z[X]/(X d + 1) where d is a power of two. The integer portion to the left of a decimal point can be represented in binary as usual while the fractional part of a real number to the right of a decimal point are represented by the highest polynomial coefficients. The latter are designated −X d−i b −i , where i is the bit position to the right of the decimal point. Intuitively, due to the modulus of R, X d = −1 and so the highest coefficients can also be represented as X −i b −i , which is as desired. Under this scheme, a number of coefficients are set aside for the fractional part of the encoded polynomial, p ∈ R. Negative numbers are represented by swapping coefficient signs. Under p, addition and multiplication work as expected and p(2) decodes to the original real number [11, 28] .
The problem with a fractional encoding scheme is that it does not appear to be compatible with SIMD operations, since the plaintext slots depend on irreducible modulus factors, F i (X), which is not the case with the modulus of R. The fractional encoding scheme can handle wrap-around modulo X d + 1 with operations involving p, although with too many multiplications, coefficients reserved for the fractional and integral parts quickly grow towards each other to yield unexpected results when the two parts coalesce. In a power of two scaled-to-integer encoding, wrapping around slot values modulo F i (X) will yield unexpected results, placing a lower bound on the ring degree of polynomials involved in operations. In practice, this usually means ensuring that the degree of two polynomial inputs involved in multiplication are each less than half of the degree, d of the plaintext slots in order to prevent wrap-around. Due to truncation of the last k bits to correct for precision in multiplication, this has the effect of lowering the degree of the polynomial product, hence multiplication can continue indefinitely in this way, albeit within ciphertext noise limits.
We choose a slight refinement to Jäschke's approach when considering the scaled-to-integer representation of real numbers which, despite having identical encoding to the original method, is conceptuality simpler when applied computationally and for descriptive purposes. After scaling a real number by 2 k and encoding it in binary, dedicate k of the lowest polynomial coefficients to encode the fractional part of a real number with binary digits b −1 b −2 ...b −k , followed by the integral portion encoded with binary digits b I + ...b 1 b 0 , and represented as:
Although binary digits are shown, any base (b > 2) can be chosen to represent a real number using this scheme (whether balanced or not), similar to fractional encoding.
For negative numbers, a two's complement rather than a sign-magnitude encoding is chosen since the former is more efficient in comparison and addition operations [9] , which have relevance to the circuit primitives discussed in the following section. A two's complement encoding of a negative real number is computed by inverting the scaled binary encoding of the positive real number and adding one. In this case, the 'highest' polynomial coefficient is reserved for encoding the negative value −2 d−k−1 , and sign extension is used to extend the bit length of the encoding to d. Using an example, the real value -1.25 within a plaintext slot where d = 8 and 4 bits of precision are required, can be two's complement encoded as a ring polynomial under extension field F 2 8 whose coefficients represent the bit sequence 11101100. The least four significant bits represent the fractional part while the Most Significant Bit (MSB) indicates the encoded value is negative (with two adjacent sign-extension bits). The final result can be decoded as: p(2) = −2 3 + 2 2 + 2 1 + 2 −1 + 2 −2 = −1.25. Under this scheme, addition and multiplication also work as expected.
Circuit Primitives
Ultimately, primitive binary circuits for addition and equality are alone necessary to combine a range of logical operations together with addition and subtraction on real numbers under a binary plaintext space. Finding efficient versions of these primitives has a flow-on effect to more complex homomorphic operations that are derived from these two basic circuits. At a bit level, XOR (⊕) and AND (·) logic gates are implemented slot-wise over a binary plaintext space via corresponding homomorphic addition (+ h ) and multiplication (× h ) operations between ciphertexts respectively. All other logic functions can be derived from these two logic gates. This includes an OR operation which is implemented as the binary inverse (by adding 1 in binary) of an AND (ie., NAND) operation. When combining a number of AND operations, ciphertexts are commonly arranged into a binary tree formation so that terms are multiplied from base to root rather than sequentially [29, 30, 6] . The effect is to reduce the multiplication depth of the circuit from n − 1 to log n, and can be readily implemented using SIMD operations.
The following sections describe various primitive circuits that can perform binary addition and equality operations, with the aim of finding optimal parallel designs to minimise the three resource parameters: memory usage, multiplicative depth and total number of multiplications. Circuits will be described generally with focus being given to the final most competitive versions. All circuits assume a ciphertext structure that takes up one slot per plaintext bit to work correctly including SIMD operations.
Addition Circuit
The most basic adder implementation is the n-bit Ripple Carry Adder (RCA) which employs n full adders, with the carry bit output from one adder acting as input into the next adder along a chain from least to most significant bits. This 'rippling' dependency of carry bits means that the circuit is unsuitable for parallel operations. The logic for a sum of two bits, a i , b i and carry-in bit, c i of the full adder is expressed as, [29] . The multiplicative depth of this adder hence is limited to n − 1. There are bit-wise and packed ciphertext RCA protocol implementations available [31] . Despite the latter potentially making use of SIMD operations, there is no real benefit conferred with respect to resource parameters used and circuit depth (refer to Table 1) .
A Carry Lookahead Adder (CLA) aims to avoid the 'rippling' effect by allowing the incoming carries to be computed in parallel. The original carry-out logic can be reformulated as, c i+1 = g i ⊕ p i · c i where g i = a i · b i is called the generate term, and p i = a i ⊕ b i , is called the propagate term. According to the carry-out logic, if g i is one (a i and b i are both one), c i+1 is one; and if p i is one (at least one of a i or b i is one), then c i is propagated to c i+1 . Since neither g i nor p i depend on the carry, they can be determined beforehand and in parallel. A pattern emerges whereby for each carry bit:
The final sum is computed as s i = p i ⊕ c i . A CLA has a lower multiplicative depth but a higher computational cost compared to the RCA. By sub-dividing n-bits into smaller equal m-bit sized blocks, this cost can be reduced. The smaller m-bit adders can be arranged into a CLA between blocks while carries within a block are rippled (or vice-versa) to realise Equation 2. In this case, the circuit depth is Θ(m + n m ) and minimised when m = Θ( √ n) [31] . Summary statistics for a CLA with an m-bit block ripple adder design [31] are presented in Table 1 .
CLAs form the basis of another group of adders called Parallel Prefix Adders (PPAs), which perform parallel group carry operations. These adders are known to be amongst the most efficient of all digital binary addition circuits [32] . The Kogge-Stone Adder (KSA) is one of the fastest adders of this type where every propagate and generate bit is computed in parallel. The circuit has a minimum possible multiplication depth and scales logarithmically, although at the expense of long wiring, more gates and a larger area requirement in digital hardware implementations [32] . These disadvantages tend to be much less significant in computing systems. In fact, the KSA appears particularly amenable to homomorphic SIMD operations as the protocol entails applying identical instructions uniformly across all plaintext slots, staged between shifting slots relative to two interacting ciphertexts. An 8-bit KSA circuit is illustrated in Fig. 1 . Fig. 1 . The Kogge-Stone Adder Circuit [33] .
As shown, each circuit node combines p and g recursively from two 'bit' columns to form a new set, (g = g ⊕p ·g , p = p ·p ). Starting at level 0, column i is combined with column i + 2 k up to level k = (log 2 n) − 1 to eventually realise Equation 2. There are bit-wise and packed ciphertext protocol versions of the KSA [31] . In the latter case, all bits from a set of slots encoded within a base ciphertext can be combined in one step with another ciphertext whose set of slots are shifted in each level by 2 k towards the MSBs. To compute the final sum in parallel, a single multiplication depth is required initially to compute all g's, and subsequently between all k level carry operations, resulting in an overall multiplication depth of log 2 n and a total cost of 2 · log 2 n multiplications. The KSA appears to be the most efficient binary addition circuit that minimises all three main resource parameters (Table 1 ). These findings are reasonably comparable with those by King [31] and Kocabaş et al. [30] who employ the KSA. This adder appears to outperform various alternative techniques used for binary addition outlined in [9, 29, 34, 6] .
Binary addition can also be used to perform subtraction using the trick of adding the first number with the two's complement negative encoding of the second number. This calculation can be incorporated into the adder algorithm by inverting the second encoded real number in addition to setting the first carryin bit (c in in Equation 2), to 1. The circuit requires only one further addition operation without any increase in its multiplicative depth. Table 1 . Comparison of addition and equality binary circuit parameter costs. nbit number input; m -bits per block. Algorithms (Alg.) shown: RCA -Ripple Carry Adder; CLA -Carry Lookahead Adder; KSA -Kogge-Stone Adder; Eq -Equality; and Comp -Comparison Logic; includes bit-wise (b) and packed (p) ciphertext versions. Memory (Mem.) is the minimum number of ciphertexts required to fulfil the protocol. lg n represents log 2 n .
Alg.
Add Shift Mult. Depth Mem.
lg n + 2 2 lg n 2 lg n lg n 4 Eq(p) 2 lg n lg n lg n 2 Comp(p) 2 lg n + 3 2 lg n + 1 lg n + 2 lg n + 1 4
Equality Circuit
A common binary equality circuit implementation with optimisation for packed ciphertexts (albeit described for integers) is based on [34, 35, 18] . The equality test also holds for a two's complement scaled-to-integer encoding of a real number and is depicted as:
The output is 1 if a and b are equal, otherwise it is 0. In this case, n − 1 multiplications are required which can be reduced to a multiplicative depth of log 2 n using a binary tree structure. The application of SIMD involves accumulating products of a packed ciphertext containing initial bit-wise values of a i ⊕ b i ⊕ 1. The slots are progressively shifted by 2 k in each level towards the MSBs and multiplied in parallel with the previous level's result. The MSB will contain the final output for the equal circuit after log 2 n levels [34] . An alternative divide-and-conquer strategy for determining integer equality with a similar multiplicative depth is described in [23] , although n − 1 homomorphic multiplications are required instead of log 2 n.
Comparison Circuit
A comparison circuit compares the magnitude between two real numbers, indicating one number is larger or smaller compared to the other. Only one type of comparison circuit is required in combination with the equal circuit, to derive all other comparison relationships (ie., <, >, , ; see below). To enable a lessthan comparison, there are two binary circuits of note [22, 18] . One determines the comparison logically while the other is derived from the addition circuit.
The comparison logic can be depicted in the following way:
). An implementation of the circuit using SIMD involves progressively shifting slots containing the term a j = b j by 2 k in each level towards the Least Significant Bits (LSBs) and multiplied with the previous level's results. The final slot output is multiplied by the expression a i < b i resulting in a multiplicative depth of log 2 n + 1 and a total of log 2 n + 2 multiplications (extra AND gate in the '<' term) [22] . Calculating a final sum of products using a binary tree structure, requires an additional log 2 n shifts and homomorphic additions to reveal the comp output. Comparison logic furthermore needs to address signed numbers in two's complement encoding. In particular, the inverse of the comp output should be returned when the signs differ between the two real numbers being compared. The final output therefore should be modified to, comp(a, b) ⊕ a n−1 ⊕ b n−1 (not accounted for in Table 1 ).
Comparisons based on the binary addition circuit use subtraction to compare two real numbers followed by testing of the sign bit of the result [30] . In a lessthan comparison, the following relationship holds: comp(a, b) = (a − b) n−1 . Subtraction can be readily implemented using the KSA described earlier.
At first glance, it might appear that the logical comparison circuit, despite having a greater multiplicative depth by 1, is overall more efficient due to its lower cost of log 2 n + 2 multiplications compared to 2 · log 2 n for the KSA comparator. However, the logical comparison algorithm has greater complexity around slot set-up and selection mask usage to perform sign correction. Such additional costs are not factored into the final parameter estimations presented in Table 1 . Ultimately, the multiplication costs become comparable between both circuits (when n 64 bits), while all other cost parameters are exceeded for the logical comparator. Alternative comparison circuits reporting a much higher total number of homomorphic multiplications, albeit of similar multiplicative depth, appear in [34, 23] .
Once c = comp(a, b) and e = equal(a, b) have been determined, all remaining comparison relationships can be derived using binary logic and at little additional cost as follows: a b = c + e; a > b = c + e; and a b = c.
Further Optimisation
From the group of primitive binary circuits reviewed in Table 1 , the packed versions of the KSA (for '+', '−' and '<' operations) and Equality (equal) circuits appear the most competitive from an overall memory, computation cost and multiplicative depth standpoint. Despite overall efficiency of these chosen schemes, each ciphertext input still only encodes a single real number, leaving many (potentially hundreds of) unused slots. Fortunately with very little modification to the basic circuits, both primitives can be further accelerated for free. Based on this approach, multiple real number computations can occur in one step within a single ciphertext per input. The parallelism across multiple primitive circuits is in addition to that occurring within each primitive (as outlined in Section 5), and both rely on SIMD. Arranging both KSA and Equality circuits in parallel to process multiple real number inputs and accelerate HE performance has not been previously described in the literature reviewed. Combining both parallel primitive circuits enables multiple arithmetic and logic operations, which are built from one or both primitives, to be executed in one step. Analyses and computations that benefit from parallel execution of the candidate sub-operations (+, − =, <, >, , ), would have a significantly reduced amortised running time.
The packing structure of the ciphertext inputs are similar for both accelerated primitives, as illustrated in Fig. 2 . Packed ciphertexts A and B contain multiple operand pairs providing input to a corresponding array of candidate sub-operations to be executed in parallel. A ciphertext consisting of s slots can pack s/n data slots, each encoding an n-bit real number. The last (s mod n) slots are disregarded. Function ParallelPacking (outlined in Alg. 1 and Fig.  2 ) pre-processes packed ciphertext operands to prepare subsequent input to functions ParallelAdderSubtractor (Alg. 2) and ParallelEquality (Alg. 3). Three outputs resulting from the pre-processing stage are P , G and C T . All three ciphertexts are required for input to Alg. 2 whereas only C T is required for Alg. 3. Importantly, Alg. 2 also requires that the MSB for each data slot to be zero in order to work correctly, leaving the remaining (n − 1) bits to encode signed or unsigned real numbers. Masking of the MSBs is shown in Fig. 2 (and represented in Alg. 1, lines 4, 15) as a constant multiplication by maskArray[] applied to all packed propagate and generate data slots (resp. P and G). Subsequently, both ciphertexts are right shifted by one bit (line 16) to move the zeroed bit to the LSB of the adjacent right data slot (in G the LSB represents c in ). All sub-operations except addition require corresponding inputs to be prepared for KSA subtraction. Both ParallelAdderSubtractor and ParallelEquality can flexibly process any n-bit number groups, not just powers of 2. The latter case however makes most efficient use of available slot resources. As a further optimisation, if the maximum bit length of numbers in all data slots is l < n, then this knowledge can reduce the multiplicative depth and computational cost of both parallel functions (eg. depth of log 2 l instead of log 2 n ). This property is used to significantly reduce the parameter costs of an accumulator function discussed in Section 6.1. for i ← 0 to s/n do for each data slot 4:
c0Array.insert
xorArray.insert(n · [0]) 10:
end if 12:
end for 13:
return CT , G, P 19: end function
Implementation
The following section demonstrates acceleration of various sample privacy-preserving applications, by integrating the aforementioned parallel algorithms. The focus is on applications that are able to sort or find the min-max from a group of encrypted numbers. Depth optimised and unoptimised versions of both applications types will be illustrated. The main aim is to demonstrate broad utility, versatility and effectiveness of these parallel algorithms. Many other privacypreserving applications and computations could potentially benefit from block execution of multiple binary operations, including other sorting algorithms. As a secondary aim, the high degree of acceleration achieved by the parallel algorithms yields very competitive performances on sorting and min-max compared to more specialised privacy-preserving applications of the same type found in the literature. Performance characteristics will be limited to multiplicative depth and parallel sorting times under a SWHE scheme, leaving analysis of number of recryptions and comparisons of larger sorting sets under a FHE scheme as future work.
Algorithm 2 Parallel Kogge-Stone Adder-Subtractor
INPUT: G, P , CT and l, where G, P , CT are packed ciphertexts with s slots and s/n data slots, packed according to Algorithm 1. l is the maximum input bit length contained in all n-bit data slots (l n). OUTPUT: CT : a ciphertext with component-wise addition or subtraction performed on every data slot pair. 1: function ParallelAdderSubtractor (G, P, CT , l) G, P , CT -ciphertext memory objects 1,2,3 2:
for i ← 0 to log 2 l do 3:
tmp ← G ciphertext memory object 4 4:
tmp ← P 8:
end if 10:
end for 11:
return CT = CT ⊕ G 12: end function
Sorting
The following demonstrators will look at accelerating sorting methods including the odd-even transposition sort and the Direct Sorting algorithm of Ç etin's work [18] . Both are ideal candidates for the parallel primitive algorithms described in this paper.
Odd-Even Transposition Sort
This simple bubble sort variant alternately compares odd and even pairs in an array of numbers in parallel. The steps for sorting 4 numbers is illustrated in Fig. 3(a) . The algorithm is guaranteed to terminate after N parallel comparison steps. Normally the N − 1 comparisons in each iteration are performed in parallel so the overall complexity is O(N ). The algorithm is readily implemented using a block of PPCs with all sub-operations set to 'less than' for each comparison iteration. Numbers are compared to their neighbours by shifting one of two identical ciphertexts packed with (n-bit) real numbers, by n bits relative to the other ciphertext. Subsequently, the parallel comparison operators are applied using SIMD. Overall this requires N applications of the parallel KSA subtractor algorithm (circuit depth of log 2 n ) to sort an array of N numbers. Each iteration however would require preprocessing of inputs according to Algorithm 1, in addition to selection and swapping of alternate odd and even pairs within the packed ciphertext, which adds a total constant circuit depth of approximately 5 per iteration. Altogether a naive approach to sorting N x n-bit real numbers in one packed ciphertext has a minimum circuit depth of [N · ( log 2 n + 5)]. Such an unoptimised sorting method would require recryption to sort a single packed ciphertext block due to the high circuit depth. Furthermore, sorting large number sets will further add to the
Algorithm 3 Parallel Equality
INPUT: CT and l, where CT is a packed ciphertext with s slots and s/n data slots, packed according to Algorithm 1. l is the maximum input bit length contained in all n-bit data slots (l n). OUTPUT: CT : a ciphertext with component-wise equality performed on every data slot pair. Results are located in the MSB of each data slot of CT . 1: function ParallelEquality(CT , l)
CT -ciphertext memory object 1 2:
if i = log 2 l − 1 and (l mod 2 i ) = 0 then 4:
shif tamt = l mod 2 i 5:
shif tamt = 2
tmp ← CT ciphertext memory object 2 9:
CT = CT · (tmp shif tamt) 10:
return CT 12: end function circuit depth, which involves computing parallel (SIMD) odd-even transposition sort spanning across multi-threaded packed ciphertext blocks. This particular sorting method will not be further considered in this paper.
Matrix Comparison Sort Ç etin's work implemented two shallow circuit sorting algorithms, Direct and Greedy Sort, which both initially make use of a comparison matrix and have an overall circuit depth of O(log 2 N + log 2 n). Only the former sorting algorithm will be examined for PPC integration. The comparison matrix, m (γ) ij for a given encrypted input vector, E(X) = X (α) 0 , ..., X (α) N −1 is described as:
where i, j < N , i < j, and m ii = 0 (the diagonal elements). The lower triangular part of the matrix, j < i is computed as m
The Direct Sort algorithm relies on calculating the sum of all elements within each matrix column, which returns the index of each X j in the sorted array. For further details see [18] .
The parallel version replicates the comparison matrix by splaying out its rows across blocks of packed ciphertexts, which are readily processing by the PPCs. Two rows of ciphertexts per block contain the input combinations required to compute only the upper triangular part of the matrix, while the corresponding inputs for the lower triangular part are all 0. The comparison matrix's diagonal elements are removed, thereby reducing the number of comparisons required to N · (N − 1) instead of N 2 . These steps are represented in Fig. 3(c) . As shown, sorting an array of four numbers requires setting up all relevant input combinations across twelve n-bit data slots. In this case, all number combinations are contained in two input ciphertexts within one block. Parallel 'less than' sub-operations are performed on all input data slots by applying the PPCs to compute the upper triangular portion of the matrix. Remaining values are calculated by adding 1 (XOR) to the partial matrix and transforming the relevant data slots to complete the lower triangular portion of the matrix. Transposing half the comparison matrix in this way involves N − 1 stages of slot shifts and insertions of XORed values at periodic intervals from the partial to completed matrix, represented by data slots within packed ciphertexts A and B respectively (as shown in Fig. 3(c) ). The shift amount of ciphertext A and the periodic slot insertions to ciphertext B required at each stage are represented in Table 2 . The steps in each stage are performed in parallel using SIMD and the algorithm can work to transform data slots across multiple ciphertext blocks. The algorithm has an overall circuit depth of one constant multiplication per stage due to the mask operation on ciphertext B.
Computing the index of every encrypted element in the sorted array requires adding the comparison matrix columns as shown in Fig. 3(c) -(A + B + C). The 
corresponding columns are aligned with N −2 shifts of the data slots by N ·n each shift. Elements in aligned columns require binary addition to compute the sorted indices. This accumulator operation (counting of elements) can be performed in parallel with an application of the PPCs, with all sub-operations set to 'add'. Since the highest index count resulting from the accumulator is expected to be N − 1, the circuit depth of one application of the KSA is log 2 ( log 2 (N − 1) ) instead of log 2 n (note l < n in Alg. 2). The accumulator function is applied N −2 times, therefore the overall circuit depth becomes (N − 2) · log 2 ( log 2 (N − 1) ). This circuit depth is relatively high. We consider instead using Ç etin's approach to calculate the Hamming distance of the matrix columns by replacing the KSA with a Wallace Tree Adder (WTA) [36] to perform the accumulation. The latter adder itself is not computed in parallel using SIMD, but since it is applied n times -once for each matrix column, this part can be multi-threaded. The WTA has an overall minimal circuit depth of log 2 (N/2). Finally, the computed sorted index values need to be converted into an actual sorted array, which requires comparing each sorted index value, σ N −1 and is described by Ç etin as follows:
The equality expression in equation 4 can be computed in parallel with a second application of the PPCs, where all sub-operations are set to 'equal'. Similar to the accumulator, the maximum number of bits to be compared for equality are log 2 (N − 1) , therefore the overall minimum multiplicative depth for the equality circuit becomes log 2 ( log 2 (N − 1) ).
Overall, there are two separate applications of the PPCs to compute a sorted array. Multiple PPCs processed in blocks are required to span the entire comparison matrix. Computations occur in groups of N data slots within blocks, called segments. For instance, to sort 8 x 32-bit real numbers using ciphertexts with 630 slots and 630/32 = 19 available data slots, will result in a maximum of 19/8 = 2 segments per ciphertext block. Thus, sorting 8 numbers requires a comparison matrix spanning 7 segments across 4 blocks. Sorting 16 x 32-bit numbers requires 15 blocks using one segment per block. Similarly, multiple blocks are required to process the equality circuit in addition to the comparison matrix. As blocks are processed independently, there is significant potential for multi-threaded block processing to further accelerate sorting times. Data slot preparations across single or multiple blocks add more operations, such as slot shifting and masking, to achieve equivalent bit manipulations compared to unpacked ciphertexts (ie. encoding individual bits or numbers). These factors increase overall circuit depth estimations in homomorphic procedures involving packed ciphertexts.
Min-Max
Depth unoptimised Min-Max To simplify analysis of a depth unoptimised min-max circuit, we initially look at processing encrypted real numbers encoded within one ciphertext block. In this case, two ciphertexts are packed with identical data slot inputs. As shown in Fig. 3(b) , slots of the lower ciphertext are progressively shifted by 2 k in each stage towards the MSB. 'Less than' comparison sub-operations are computed in parallel from the previous stage's output using a single application of the PPCs per stage. In each stage, the PPCs' output is used as a selection mask to choose data slots from either upper or lower ciphertext inputs depending on whether the final desired output is a minimum or maximum value, respectively. Each stage therefore has a minimum circuit depth of log 2 n + 1. The MSB will contain the final min-max output after log 2 N stages, resulting in an overall minimum circuit depth of log 2 N · ( log 2 n + 1).
To determine the min-max value over large number sets, the 'less than' PPCs are first repeatedly applied across B multiple ciphertext blocks using a log 2 B stage binary tree similar to [30] . This process has a minimum circuit depth of log 2 B · ( log 2 n + 1). Output from this stage is subsequently reduced onto one ciphertext block which can be processed finally by the singleblock min-max circuit of similar depth; resulting in an overall minimum depth of ( log 2 B + log 2 N ) · ( log 2 n + 1). Despite the relative lower complexity of depth unoptimised min-max compared to the corresponding sorting algorithm, recryption would likely be required to process one ciphertext block of inputs. We therefore turn our attention to the depth optimised min-max algorithm and do not further consider the unoptimised version in this paper.
Matrix Comparison Min-Max
The depth optimised version of min-max is a derivation of Ç etin's Direct Sort algorithm. All steps are identical to matrix comparison sorting described in Section 6.1, except during the final stage where all matrix columns are multiplied instead of added, as shown in Fig. 3(c) -(A·B ·C). Multiplication generates a mask pointing to the index of the maximum value for the number array input. If the comparison matrix is inverted beforehand (all data slots in ciphertext B are XORed), the output of multiplication points to the minimum value. Multiplication can be performed firstly across blocks using a binary tree structure, then across segments within blocks using SIMD, resulting in a minimum circuit depth of log 2 (B · N ) in the final stage. This stage replaces the combined accumulator and equality circuit stages present in sorting, resulting in an overall reduced circuit depth for the optimised version of min-max compared to sorting. Multi-threaded processes could accelerate homomorphic multiplication of multiple ciphertext blocks at each binary tree level (with the lowest potential closest to the root), in addition to the multi-threaded computation of the comparison matrix blocks.
Evaluation
We evaluate performance of the accelerated matrix comparison sort against Ç etin's standard open source implementation 6 , in addition to evaluating the depth optimised version of min-max. Our implementations are in C++ and use the HElib library for homomorphic operations with multi-threading turned on, and compiled using the NTL library version 10.5.0 and GMP version 6.1.2. All simulations are conducted using a m2.4xlarge AWS EC2
7 execution environment, which is reported to use an Intel Xeon E5-2686 v4 processor with 8 vCPU cores at 2.30GHz and 32GB RAM, running Ubuntu 16.04.3 LTS. As shown in Table 3 , parameters are derived to ensure that the security of HElib's BGVbased SWHE scheme is conservatively higher than that of the NTRU-based LTV-SWHE scheme used in Ç etin's work. It should be noted that matching the security of differing SWHE schemes is a tricky problem [37] , including having potential issues surrounding NTRU's security assumptions [38, 39] . The security level of all conducted tests are upwards from 108-bit, estimated according to [27] . HElib defaults are used for parameters not listed in Table 3 . Table 4 displays sorting and min-max performance statistics on 4, 8 and 16 elements encoding 31-bit real numbers (1 bit less due to ciphertext packing structure). Average execution times are shown for each algorithm, where 20 runs are conducted for each array size test. This amount is adequate given the low coefficient of variance shown for executions times within each test (normally < 0.01). Depth optimised algorithms are accelerated using multi-threaded operations across blocks involving PPCs, the impact of which is explicitly demonstrated (multi-threading 'off' vs 'on' in Table 4 ). Although WTA accumulator multi-threading could potentially accelerate N repeated Hamming distance computations of the matrix columns, this was not done to allow focus on parallel effects of the PPCs alone. For comparison purposes, Ç etin's sorting code is run within the same execution environment as the PPC implementation. Table 4 demonstrates a significant speed up of sorting performance by a factor of approximately 25-30 times as a result of batched SIMD binary operations. The circuit depth associated with each array size input is higher in the PPC version due to overheads associated with processing packed ciphertexts. Ciphertext sizes are also compared for each array size. In the case of Ç etin's matrix sort, despite being 1 10 the size comparatively, one ciphertext is generated per bit of input, therefore N · n ciphertexts are processed initially during input (∼ 335MB for N =16, 32-bit numbers). In the PPC matrix sort, only 2 ciphertexts per block, B are necessary to process inputs according to Alg. 1 (∼ 180MB for N =16, 32-bit numbers). Subsequently, only 4 · B and 2 · B ciphertexts are required to compute Algs. 2 and 3 respectively. Processing fewer packed ciphertexts effectively reduces memory and communication costs associated with a homomorphic application. 
