GPUHElib and DistributedHElib: Distributed Computing Variants of HElib, a Homomorphic Encryption Library by Frame, Ethan Andrew
GPUHELIB AND DISTRIBUTEDHELIB: DISTRIBUTED COMPUTING VARIANTS
OF HELIB, A HOMOMORPHIC ENCRYPTION LIBRARY
A Thesis
presented to
the Faculty of California Polytechnic State University,
San Luis Obispo
In Partial Fulfillment
of the Requirements for the Degree
Master of Science in Computer Science
by
Ethan Frame
June 2015
c© 2015
Ethan Frame
ALL RIGHTS RESERVED
ii
COMMITTEE MEMBERSHIP
TITLE: GPUHElib and DistributedHElib:
Distributed Computing Variants of HElib,
a Homomorphic Encryption Library
AUTHOR: Ethan Frame
DATE SUBMITTED: June 2015
COMMITTEE CHAIR: Zachary N J Peterson, Ph.D.
Assistant Professor of Computer Science
COMMITTEE MEMBER: John Clements, Ph.D.
Associate Professor of Computer Science
COMMITTEE MEMBER: Robert Easton, Ph.D.
Professor of Mathematics
iii
ABSTRACT
GPUHElib and DistributedHElib: Distributed Computing Variants of HElib, a
Homomorphic Encryption Library
Ethan Frame
Homomorphic Encryption, an encryption scheme only developed in the last five years,
allows for arbitrary operations to be performed on encrypted data. Using this scheme, a
user can encrypt data, and send it to an online service. The online service can then per-
form an operation on the data and generate an encrypted result. This encrypted result is
then sent back to the user, who decrypts it. This decryption produces the same data as if
the operation performed by the online service had been performed on the unencrypted
data. This is revolutionary because it allows for users to rely on online services, even un-
trusted online services, to perform operations on their data, without the online service
gaining any knowledge from their data.
A prominent implementation of homomorphic encryption is HElib. While one is able
to perform homomorphic encryption with this library, there are problems with it. It, like
all other homomorphic encryption libraries, is slow relative to other encryption systems.
Thus there is a need to speed it up. Because homomorphic encryption will be deployed on
online services, many of them distributed systems, it is natural to modify HElib to utilize
some of the tools that are available on them in an attempt to speed up run times. Thus two
modified libraries were designed: GPUHElib, which utilizes a GPU, and DistributedHE-
lib, which utilizes a distributed computing design. These designs were then tested against
the original library to see if they provided any speed up.
iv
ACKNOWLEDGMENTS
Thanks to:
• Andrew Guenther, for uploading this template
• Dr. Easton and Dr. Peterson for proofreading drafts of this document and giving
valuable feedback.
v
TABLE OF CONTENTS
Page
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER
1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 Homomorphic Encryption . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1.1 Gentry’s Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.2 Second Generation Designs . . . . . . . . . . . . . . . . . . . . . 5
2.1.3 FHE without Bootstrapping . . . . . . . . . . . . . . . . . . . . . 7
2.2 HElib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.1 HElib Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Distributed Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3.1 Parallel Computing on GPU . . . . . . . . . . . . . . . . . . . . . 11
2.3.2 Distributed Computing with OpenMPI . . . . . . . . . . . . . . . . 12
3 GPUHELIB DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 Memory Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.1 Mapping from CPU to GPU . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 GPU Vector Management . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Overflow Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2.1 Addition Overflow Considerations . . . . . . . . . . . . . . . . . . 18
3.2.2 Subtraction Overflow Considerations . . . . . . . . . . . . . . . . 19
3.2.3 Multiplication Overflow Considerations . . . . . . . . . . . . . . . 21
3.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.3.1 CUDA Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
vi
3.3.2 Overlapping Kernel Execution . . . . . . . . . . . . . . . . . . . . 26
3.3.3 2-Way Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.4 3-Way Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.5 Stream Management . . . . . . . . . . . . . . . . . . . . . . . . . 30
4 DISTRIBUTEDHELIB DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Node Cluster Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1.1 Work Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.2 Memory Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2.1 Mapping from Dispatcher Node to Compute Nodes . . . . . . . . . 36
4.2.2 Compute Node Vector Management . . . . . . . . . . . . . . . . . 37
4.3 Concurrency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.3.1 Non-Blocking Send and Receive with OpenMPI . . . . . . . . . . 38
4.3.2 Syncing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5 EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.1 Evaluation Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.1 Test Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.1.2 HElib Timing Functions . . . . . . . . . . . . . . . . . . . . . . . 44
5.2 Testing Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 GPU Testing Environment . . . . . . . . . . . . . . . . . . . . . . 46
5.2.2 Distributed Computing Testing Environment . . . . . . . . . . . . 46
5.3 GPUHElib Evaluation Results . . . . . . . . . . . . . . . . . . . . . . . . 47
5.3.1 GPUHElib Circuit Level Run Times . . . . . . . . . . . . . . . . . 47
5.3.2 GPUHElib Function Level Run Times . . . . . . . . . . . . . . . . 49
5.3.3 GPUHElib Phase Level Run Times . . . . . . . . . . . . . . . . . 53
5.4 DistrubtedHElib Evaluation Results . . . . . . . . . . . . . . . . . . . . . 61
5.4.1 DistributedHElib Circuit Level Run Times . . . . . . . . . . . . . 61
5.4.2 DistributedHElib Function Level Run Times . . . . . . . . . . . . 66
5.4.3 DistributedHElib Distribute and Wait Run Times . . . . . . . . . . 71
5.5 Evaluation Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
6 FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1 GPUHElib Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
6.1.1 Persistent Memory in GPU . . . . . . . . . . . . . . . . . . . . . . 81
vii
6.1.2 Full Operation Implementation . . . . . . . . . . . . . . . . . . . . 82
6.2 DistributedHElib Future Work . . . . . . . . . . . . . . . . . . . . . . . . 82
6.2.1 Distributed Memory on Compute Nodes . . . . . . . . . . . . . . . 82
6.2.2 Full Operation Implementation . . . . . . . . . . . . . . . . . . . . 83
7 CONCLUSIONS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
BIBLIOGRAPHY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
APPENDICES
APPENDIX A KERNELS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1 Addition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
A.1.1 Addition of two DoubleCRT objects . . . . . . . . . . . . . . . . . 89
A.1.2 Addition of a DoubleCRT object and a constant . . . . . . . . . . . 90
A.2 Subtraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
A.2.1 Subtraction of two DoubleCRT objects . . . . . . . . . . . . . . . 90
A.2.2 Subtraction of a DoubleCRT object and a constant . . . . . . . . . 91
A.3 Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
A.3.1 Multiplication of two DoubleCRT objects . . . . . . . . . . . . . . 92
A.3.2 Multiplication of a DoubleCRT object and a constant . . . . . . . . 94
APPENDIX B VECTOR MANAGEMENT . . . . . . . . . . . . . . . . . . . . . . 98
B.1 Device Vector Management . . . . . . . . . . . . . . . . . . . . . . . . . . 98
B.2 Compute Node Buffer Management . . . . . . . . . . . . . . . . . . . . . 99
APPENDIX C CONCURRENCY MANAGEMENT . . . . . . . . . . . . . . . . . 100
C.1 Device Stream Management . . . . . . . . . . . . . . . . . . . . . . . . . 100
C.2 Synchronization Management . . . . . . . . . . . . . . . . . . . . . . . . 101
viii
LIST OF TABLES
Table Page
5.1 Serial HElib circuit level run times (in seconds) . . . . . . . . . . . . . . 48
5.2 GPUHElib circuit level run times (in seconds) . . . . . . . . . . . . . . . 48
5.3 Serial HElib function level run times (in seconds) . . . . . . . . . . . . . 50
5.4 GPUHElib function level run times (in seconds) . . . . . . . . . . . . . . 50
5.5 GPUHElib Add phase level run times (in seconds) . . . . . . . . . . . . 54
5.6 GPUHElib Sub phase level run times (in seconds) . . . . . . . . . . . . . 55
5.7 GPUHElib Mul phase level run times (in seconds) . . . . . . . . . . . . 56
5.8 Serial HElib circuit level run times (in seconds) . . . . . . . . . . . . . . 62
5.9 DistributedHElib circuit level run times (in seconds) on 4 nodes . . . . . 62
5.10 DistributedHElib circuit level run times (in seconds) on 8 nodes . . . . . 63
5.11 DistributedHElib circuit level run times (in seconds) on 16 nodes . . . . . 63
5.12 Serial HElib function level run times (in seconds) . . . . . . . . . . . . . 67
5.13 DistributedHElib function level run times (in seconds) on 4 nodes . . . . 67
5.14 DistributedHElib function level run times (in seconds) on 8 nodes . . . . 67
5.15 DistributedHElib function level run times (in seconds) on 16 nodes . . . . 67
5.16 DistributedHElib distribute run times (in seconds) on 4 nodes . . . . . . 72
5.17 DistributedHElib sync run times (in seconds) on 4 nodes . . . . . . . . . 72
5.18 DistributedHElib distribute run times (in seconds) on 8 nodes . . . . . . 72
5.19 DistributedHElib sync run times (in seconds) on 8 nodes . . . . . . . . . 72
5.20 DistributedHElib distribute run times (in seconds) on 16 nodes . . . . . . 73
5.21 DistributedHElib sync run times (in seconds) on 16 nodes . . . . . . . . 73
ix
LIST OF FIGURES
Figure Page
2.1 HElib Type Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Data Mapping from CPU to GPU . . . . . . . . . . . . . . . . . . . . . 16
3.2 Moduli Mapping from CPU to GPU . . . . . . . . . . . . . . . . . . . . 16
3.3 Serial GPU Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.4 Concurrent GPU Execution with 3 Streams . . . . . . . . . . . . . . . . 25
3.5 GPU 2-Way Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.6 GPU 3-Way Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.1 Rolling Round Robin Example with More Nodes than Data Pieces . . . . 34
4.2 Rolling Round Robin Example with Less Nodes than Data Pieces . . . . 35
4.3 Data Mapping from Dispatcher to Compute Nodes . . . . . . . . . . . . 36
4.4 Moduli Mapping from Dispatcher to Compute Nodes . . . . . . . . . . . 37
5.1 Run Time Comparison at Circuit Level . . . . . . . . . . . . . . . . . . 48
5.2 Add Run Times Comparison at Function Level . . . . . . . . . . . . . . 50
5.3 Sub Run Times Comparison at Function Level . . . . . . . . . . . . . . 51
5.4 Mul Run Times Comparison at Function Level . . . . . . . . . . . . . . 52
5.5 Add Phase Level Run Times Comparison - Operation . . . . . . . . . . . 54
5.6 Add Phase Level Run Times Comparison - Memory . . . . . . . . . . . 55
5.7 Sub Phase Level Run Times Comparison - Operation . . . . . . . . . . . 56
5.8 Sub Phase Level Run Times Comparison - Memory . . . . . . . . . . . . 57
5.9 Mul Phase Level Run Times Comparison - Operation . . . . . . . . . . . 58
5.10 Mul Phase Level Run Times Comparison - Memory . . . . . . . . . . . . 59
5.11 Run Time Comparison at Circuit Level on 4 Nodes . . . . . . . . . . . . 62
5.12 Run Time Comparison at Circuit Level on 8 Nodes . . . . . . . . . . . . 63
5.13 Run Time Comparison at Circuit Level on 16 Nodes . . . . . . . . . . . 64
x
5.14 Add Run Times Comparison at Function Level . . . . . . . . . . . . . . 68
5.15 Sub Run Times Comparison at Function Level . . . . . . . . . . . . . . 69
5.16 Mul Run Times Comparison at Function Level . . . . . . . . . . . . . . 70
5.17 Add Third Level Run Times Comparison - Distribute . . . . . . . . . . . 73
5.18 Add Third Level Run Times Comparison - Sync . . . . . . . . . . . . . . 74
5.19 Sub Third Level Run Times Comparison - Distribute . . . . . . . . . . . 75
5.20 Sub Third Level Run Times Comparison - Sync . . . . . . . . . . . . . . 76
5.21 Mul Third Level Run Times Comparison - Distribute . . . . . . . . . . . 77
5.22 Mul Third Level Run Times Comparison - Sync . . . . . . . . . . . . . . 78
xi
CHAPTER 1
INTRODUCTION
In the last five years the design and development of a new encryption scheme that could
enhance the level of security on the internet has exploded. That encryption scheme is
known as Homomorphic Encryption.
First conceived over thirty years ago, and finally designed in 2009 by Craig Gentry,
homomorphic encryption is a revolutionary encryption scheme because it allows for com-
putation to be performed on encrypted data. This means that a user can encrypt their data,
and send it to a service. That service can then perform an operation, and send the en-
crypted result back. Upon decryption by the user, the result received will be exactly the
same as if the data had not been encrypted at all, and the operation had been computed on
the unencrypted data. The added benefit of this system however, is that the user data was
never known by the service, thus the user can be assured that their information remains
secret. By putting this encryption scheme on online services, user data can be passed from
online service to online service without the user being worried of their information being
known.
1
A few implementations of homomorphic encryption have been design in recent years.
One of the most prominent is HElib. This library however, like all homomorphic encryp-
tion libraries, is not currently in use because it suffers from slow run times. Thus, there is
a desire to speed it up.
Because the target audience for these schemes is online services (many of which are
designed as distributed systems), it makes since to try and modify HElib to take advan-
tage of these systems. Thus two modified libraries were designed in the hope they would
perform better than the original library. GPUHElib, which utilizes a GPU; and Distribut-
edHElib, which uses a distributed computing design, were both designed in the hope that
they would provide run time improvements over HElib.
Unfortunately, these modified libraries fail to provide any speedup, as we will show
later. These designs exhibit the same pitfalls that other distributed systems do. Mainly
memory transfer speeds are too slow, which cause huge slowdowns compared to the origi-
nal unmodified library. However there could be hope for them, given further work.
2
CHAPTER 2
BACKGROUND
2.1 Homomorphic Encryption
Homomorphic encryption is a form of encryption that allows computations to be per-
formed on ciphertexts, thus generating an encrypted result, which when decrypted matches
the result of operations performed on the plaintext. For example, the numbers 4 and 5
could be encrypted to A and B. Let C = A + B. When C is decrypted, its value will be 9. So
common operations, like addition and multiplication, can be performed on encrypted data,
and produce the same result as if the data was not encrypted in the first place. This is a
desired feature in encryption schemes, because it allows encrypted data to be passed from
online service to online service, each service performing operations on the data, without
the online service knowing what the data is.
Currently, for an online service to perform an operation for a user, the online service
must know what the data is. Thus any online service will know what data a user is giving
them. However with homomorphic encryption, an online service (even an untrusted on-
3
line service), can perform operations on user data, without the user being worried about
their data being known or exposed.
A fully homomorphic encryption (FHE) scheme has long been sought after [14]. For a
scheme to be fully homomorphic, any arbitrary operation must be able to be executed, and
still produce the correct results. It may seem like many operations need to be supported if
any arbitrary operation can be performed, but in reality a fully homomorphic scheme need
only support both addition and multiplication, as every other operation can be derived
from those two. Partially homomorphic encryption schemes, schemes that only support
one operation, have been known to exist since the 1970s. A few schemes that are known
to be partially homomorphic are Unpadded RSA, ElGamal, Goldwasser-Micali, Benaloh,
and Paillier. All of these schemes only support one operation, either addition or multipli-
cation, but not both. It took more than 30 years before a fully homomorphic scheme was
designed by Craig Gentry.
2.1.1 Gentry’s Design
Craig Gentry proposed the first fully homomorphic encryption scheme in 2009 utilizing
lattice-based cryptography [6]. His scheme supported both addition and multiplication,
from which any arbitrary operation could be derived.
Gentry started by designing what was called a somewhat homomorphic encryption
(SWHE) scheme. A SWHE scheme supports arbitrary operations, but could only com-
4
pute a limited number of operations. This is because the scheme uses a noise factor when
representing a ciphertext. When an operation is performed the noise in the representa-
tion grows. If the noise grows too large, then the ciphertext becomes incorrect when de-
crypted. Gentry then took this SWHE scheme and added bootstrapping, meaning it could
evaluate its own decryption circuit. This bootstrapping procedure allowed for the cipther-
texts to be “refreshed”, where the noise would be decreased, thus allowing for more op-
erations to take place. Finally Gentry proved that a bootstrappable SWHE scheme can be
made into a FHE scheme, by continually performing the bootstrapping procedure when
the noise reaches a certain limit.
Gentry’s scheme was the first construction of a FHE scheme, however impractical
when implemented [7], because the ciphertext size and computation time increased sharply
as the security level increased. It could take upwards of 30 minutes to compute operations
on a single bit, for large security levels. Numerous improvements [17, 15, 16, 8] were of-
fered to try to optimized the solution, however new techniques were required to produce a
much more efficient scheme, which created a second generation of designs.
2.1.2 Second Generation Designs
Many researchers worked to create a second generation of FHE schemes [4, 1, 13, 9].
These new schemes relied on the Learning with Errors (LWE) problem and all featured
much slower growth of the noise factor compared to Gentry’s original design. There were
5
two optimizations found during this time that lead to a breakthrough design: ciphertext
packing and modulus switching.
Ciphertext packing [2] is a technique that allows for multiple plaintexts to be en-
crypted and placed into a single ciphertext. With these new schemes based on the LWE
problem, the ciphertexts must be large in order to satisfy security concerns. These large
ciphertexts cause the operations being performed to be slow. When a single plaintext en-
crypts to a single large ciphertext, then it is evident that there will be a cost to efficiency.
By packing the ciphertexts, allowing a vector of plaintexts to be encrypted into a single
ciphertext, the cost to efficiency is almost negated. These ciphertexts can then be operated
on component-wise in a SIMD (single instruction, multiple data) fashion.
Modulus switching [5] is a technique that allows the noise present in a ciphertext to
be decreased without performing the bootstrapping operation. Each ciphertext in these
schemes is relative to a modulus, p. Given a ciphertext c mod p, one can transform it into
a ciphertext c’ mod p’, which will have a lower noise factor, without knowing the secret
key. By using modulus switching after every multiplication (the operation responsible
for the largest noise increase) and by choosing the moduli carefully, the noise factor after
multiplication will be unchanged. This technique allowed the largest modulus to grow
linearly with multiplicative depth, which was a large improvement over previous systems.
These techniques were discovered separately by different researchers, before they
were combined to create a single scheme, the Brakerski-Gentry-Vaikuntanathan (BGV)
6
scheme, which allowed for a FHE scheme that did not need bootstrapping.
2.1.3 FHE without Bootstrapping
The BGV scheme [3] combined the two techniques described above to create a FHE
scheme that did not need a bootstrapping operation in order to perform arbitrary oper-
ations. This was a breakthrough because the bootstrapping operation was the one that
cost the most run time in other similar schemes. Even though the scheme does not need
bootstrapping, the authors do have a bootstrapping procedure for this scheme, but it is
used as an optimization, not as a necessary component in making this scheme fully ho-
momorphic. This scheme also has a SIMD design, which will allow for the possibility of
speedup, seen in later chapters. This is the scheme that HElib is built off of.
2.2 HElib
HElib [11, 12, 10] is an open source implementation of the BGV scheme. It was devel-
oped by Shai Halevi and Victor Shoup in 2014, and is designed to be a very low-level
library intended for research purposes.
The intention of our work is to enhance the run time when performing operations in
this library. Thus the creation, encryption, and decryption of plaintexts and ciphertexts
was not examined, as those operations will likely occur on the users end, not on the online
service’s end. The design of the addition and multiplication operations are the areas fo-
7
Figure 2.1: HElib Type Hierarchy
cused on for this work. Before addressing the changes present in the modified libraries, it
is necessary to understand the current serial implementation of HElib.
2.2.1 HElib Design
Let ∗ be the operation being performed (here ∗ could stand for any of the operations, all
of them are handled similarly) and A and B be the ciphertexts being operated on. The exe-
cution of the operation A= A ∗ B requires a few steps.
A and B are stored as Ctxt objects in HElib. Figure 2.1 shows the type hierarchy for a
Ctxt object. Ctxt objects have one important variable, parts, a vector containing multi-
ple CtxtParts. These parts constitute the ciphertext. The operations supported by Ctxt
are the addition, subtraction, and multiplication of two Ctxt objects. Each of these oper-
ations use parts during the execution of the operation, thus the operations in CtxtPart
are called.
CtxtPart is an extension of the class DoubleCRT, which is where the operations are
implemented. Listing 2.1 is an excerpt from the function that performs the operations.
8
Listing 2.1: Add, Sub and Mul operations of two DoubleCRT objects
. . .
cons t I n d e x S e t& s = map . g e t I n d e x S e t ( ) ;
long phim = c o n t e x t . zMStar . getPhiM ( ) ;
f o r ( long i = s . f i r s t ( ) ; i <= s . l a s t ( ) ; i = s . n e x t ( i ) ) {
long p i = c o n t e x t . i t h P r i m e ( i ) ;
v e c l o n g& row = map [ i ] ;
cons t v e c l o n g& o t h e r r o w = (∗ o t h e r m a p ) [ i ] ;
f o r ( long j = 0 ; j < phim ; j ++) {
row [ j ] = fun . a p p l y ( row [ j ] , o t h e r r o w [ j ] , p i ) ;
}
}
. . .
As the index set is iterated over, the ith prime is extracted along with the ith row from
the maps. Even though the map is accessed like an array, it is an unordered map, with the
array access syntax for convenience. These rows are then iterated over, applying the oper-
ation to each element. This is where the SIMD design is occurring, a double for loop to
add, subtract or multiply two vectors together. This is where the modifications and possi-
ble run time speedups can occur by using distributed system techniques.
2.3 Distributed Systems
Distributed systems is a field of computer science that deals with computer systems per-
forming concurrent computation to achieve a goal. These systems can be as tight as a
9
single computer running multiple threads or as loose as a group (cluster) of computers
(nodes) connected via a message passing interface all over the world. What these systems
have in common is that they are all connected and working to achieve a single goal.
The benefit of using a distributed computing system is the possibility for concurrent
computation. Non-distributed computing systems are limited to only serial execution of
programs. This means that if, for example, the system were tasked with adding two vec-
tors together, it would have to loop over all entries one after the other and add each indi-
vidually. For large vectors, this could be time consuming. Each individual operation is
independent, and can therefore be performed at the same time. This is the purpose of a
distributed system. By partitioning the data, and assigning a portion of the work to each
node in the system, concurrent computation can be performed. For large vectors, this may
result in a speedup in the run time, because of the concurrent execution.
The caveat of using a distributed computing system is the possibility for large over-
head times, that can slow down computation. Time can be lost because of the added op-
erations needed to facilitate the concurrent computation. The data must be partitioned
and sent to the node that is performing the operation, which can cost time, if the means
of transfer between nodes is slow, because in the original serial design, this did not occur.
So for large vectors, it may be faster to perform the operation because of the concurrent
execution, however in order to be able to perform the concurrent execution, some setup is
required, that might cause the overall run time to be slower than the original serial execu-
10
tion. This is the trade off when working with distributed systems. Much work is done to
reduce the amount of overhead in distributed systems, but they all suffer from it.
Distributed systems can be classified into two categories: parallel computing or dis-
tributed computing. A parallel computing system allows for shared memory, meaning that
the processors all have access to a common memory which can be used to exchange infor-
mation between processes. A common example of parallel computing is a graphics pro-
cessing unit (GPU), because it is a single piece of hardware that has a common memory
with many processors operating at once. In distributed computing systems, each proces-
sor has private memory and a message passing interface is used to exchange information
between processes. A common example of this is the internet. Each computer attached to
the internet has its own private memory and uses message passing to communicate with
other machines on the internet. A common message passing interface is OpenMPI, which
allows for the creation and running of a distributed computing system.
2.3.1 Parallel Computing on GPU
GPUs were designed to manipulate images for output on a display. Because of this, their
design was such that they would perform operations in parallel over every pixel need-
ing to be manipulated. Each pixel, or piece of data, was given its own compute core, that
could be executed concurrently with every other compute core, which is how parallel ex-
ecution was achieved. Expanding beyond manipulating images, one can see that this ap-
11
proach to computation can be applied to any circuit where the input is discrete and the
same operation is applied to each piece of the input.
Going back to the example given earlier, each element in the vectors can be assigned
to a separate core. Then each core can be executed simultaneously, and the result will be
generated. For large vectors, a loop’s run time would increase as the size of the vector in-
creased. However, by using a GPU, the run time would be the same for any size vector
(because the operation being performed was always the same, and because all the cores
were executed at the same time). It makes no difference if 20 or 20,000 cores were ex-
ecuted, they would both take the same amount of time. This has allowed GPUs to be
used for many more purposes than just manipulating images, and in recent years to help
speedup the run times of SIMD algorithms.
2.3.2 Distributed Computing with OpenMPI
Distributed computing systems are a cluster of machines all linked through a message
passing interface. One would use a distributed computing system when computation can
not be performed on a single system alone because either the input is too large or the
computation will take too long. By partitioning the input and assigning each node a por-
tion of the work, the task can be completed, where before it could not.
A widely used message passing interface is OpenMPI. OpenMPI allows for the cre-
ation of a distributed cluster with a single call, and provides functions to send and receive
12
data from other nodes in the cluster easily. This allows for the partitioning and scheduling
of work on a distributed system to be easily achieved.
Again addressing the example given earlier, depending on the number of nodes in the
cluster, the vector can be partitioned so that each node gets roughly the same amount of
elements. Each node can then compute the addition operation on only the portion of the
vectors it is assigned. This allows for the concurrent execution of the addition operation,
thus decreasing the run time compared to the serial computation. Distributed computing
systems have become the design used by online services in recent years because they can
service multiple customers at a single time, which is a desired feature when working on
the internet.
13
CHAPTER 3
GPUHELIB DESIGN
HElib is one of the only implementations of homomorphic encryption. It suffers (like all
implementations of homomorphic encryption) from slow run times compared to other
standard schemes. Thus there is a desire to improve HElib by speeding it up. Because
HElib is meant to be deployed on online services, it is natural to try an utilized hardware
available on them. Thus the idea for GPUHElib.
GPUHElib is a variant of HElib, which attempts to speed up the run time of HElib by
utilizing a GPU to parallelize operations. GPUs are often used to speed up computation
where a single instruction or operation is performed on multiple pieces of data. GPUs are
ideal for these types of designs because they allow for many compute cores to be run si-
multaneously, each performing the same operation. HElib utilizes a single instruction,
multiple data (SIMD) design, however is single threaded. Meaning that, while being de-
signed so a single operation occurs over multiple data pieces, the library is not efficiently
utilizing the design to best effect. The hope then of adding GPU functionality to the li-
brary is to thus take advantage of this design, by utilizing hardware that will best handle
14
the SIMD nature of the scheme.
There are three phases when executing operations on a GPU. First the memory is
copied from the host(CPU) to the device(GPU). Then a program(kernel) is created, which
performs the operation on the data in the device’s memory. Finally, the data is copied
back from the device to the host, upon which the host continues execution. The first and
third phases are discussed further in Section 3.1, and the kernel designs are addressed
in Section 3.2. Furthermore, these three phases can be parallelized to achieve the fastest
speedup, discussed in Section 3.3.
3.1 Memory Mapping
In order for the GPU to execute a kernel, it must have the data in a 1D vector. This re-
quires that the data be mapped from its current storage model into a 1D vector. There are
two pieces of data that are required to be mapped when performing an operation: the data
that the operation will be performed on, and the moduli.
3.1.1 Mapping from CPU to GPU
Currently the data is stored as shown in Figure 3.1. The map contains vectors or rows,
each of these rows are arrays of 64-bit integers. This structure is a non-contiguous 1D
vector that needs to be mapped to a contiguous 1D vector. Thus the rows must be copied
into a new vector, which the GPU will then operate on. Each successive row is concate-
15
Figure 3.1: Data Mapping from CPU to GPU
Figure 3.2: Moduli Mapping from CPU to GPU
nated to the preceding rows, thus creating a 1D vector.
Similarly the moduli are being stored as individual elements. In order to be used dur-
ing execution on the GPU, they must also be mapped to a 1D vector. Figure 3.2 shows
the current storage model for the moduli. Each successive modulus is concatenated to the
preceding moduli, thus creating a 1D vector.
3.1.2 GPU Vector Management
Naively creating and freeing vectors on the GPU when the operation is run slows down
run times because allocating and freeing memory on the GPU takes time. Creating a few
16
vectors at the beginning of execution, and maintaining them throughout the programs life-
time, allows for the most efficient memory usage, and greatest speedup.
Four vectors are created and maintained throughout the lifetime of the program. The
first is a vector of size num rows× size o f row. This is the 1D vector that the data from
the first DoubleCRT object is being copied into. The second is another vector of size num rows×
size o f row, which holds the data that is being copied from the other DoubleCRT object.
The third is a vector of size num rows. This vector holds the moduli. The last vector is
also of size num rows, and is used when computing over a single DoubleCRT object and
a constant number. The constant number, mod the moduli, is stored in this last vector. The
sizes, num rows and size o f row, rarely change, thus there will be little memory reallo-
cation occurring, and reallocation will only occur when the vector is too small, not when
it is larger than needed. The function that handles initializing and reallocation of these
vectors is found in Appendix B.1.
3.2 Overflow Considerations
Arithmetic overflow occurs when the result of an arithmetic operation is greater than the
magnitude of the storage location that is being used to store the value. For example, using
4-bit unsigned integers, the largest possible value that can be stored is 24− 1 = 15. Thus
when adding 12+ 14 = 26, an overflow error will occur, because 26 requires 5 bits to
store.
17
The type of the data being operated on in HElib is 64-bit signed integers. Meaning
they can take values from −(263) to 263− 1. However, these values will never be nega-
tive, thus the actual range of these values is 0 to 263−1. The largest modulus could be the
largest possible value, 263− 1 and the largest value being operated on could be 263− 2.
The reason the largest value being operated on is one less than the largest value, 263− 1,
is because the value has to be smaller than the modulus, thus (263− 1)− 1 = 263− 2.
In CUDA, the language used on NVIDIA GPUs, (the language this implementation is
written in) the largest variable type has a length of 64 bits. The type that can contain the
largest value is uint64_t, unsigned 64-bit integers. This type can store values from 0 to
264−1.
Six operations were designed, as they are the most commonly used operations. Ad-
dition, subtraction, and multiplication of a DoubleCRT with another DoubleCRT and ad-
dition, subtraction, and multiplication of a DoubleCRT with a constant number. Further
considerations for overflow prevention for each operation is discussed in the following
sections.
3.2.1 Addition Overflow Considerations
When considering the overflow prevention for addition, it is necessary to compute what
the largest value could possibly be. As noted above, the largest a value could be is 263−
2 and the largest modulus is 263− 1. Performing the addition operation on the possible
18
values,
(263−2)+(263−2) = 264−4 (3.1)
shows that the number 264− 4 needs to be computed, before the modulus operation takes
place. This number is outside the range for signed 64-bit integers, but not for unsigned
64-bit integers. Thus the original numbers must be cast to unsigned 64-bit integers (which
could cause problems if any of the numbers were negative, but since the values are al-
ways positive, there is no problem). After the addition operation takes place, the modulus
operation brings the result back down to the range of signed integers, because the modu-
lus is in the range of signed integers. The result is then finally cast back to a signed inte-
ger, and the operation is complete. The kernels for both addition operations, between two
DoubleCRT objects and between a DoubleCRT object and a constant are in Appendix A.1.
3.2.2 Subtraction Overflow Considerations
Similar to addition, it is necessary to compute the worst case scenario when considering
overflow prevention for subtraction. Again, the largest a value could be is 263− 2 and the
largest modulus is 263− 1. There are two scenarios to consider for subtraction: the first
number being 263−2 and the second number being 0 and the first number being 0 and the
second number being 263−2. Performing the subtraction operation for both scenarios,
(263−2)−0 = 263−2 (3.2)
19
0− (263−2) =−(263−2) (3.3)
shows that all the computations can be completed using signed integers, because all num-
bers in the above equations are within the range of signed integers. Thus there does not
need to be any step taken to prevent overflow for this design.
With this design, to ensure the resultant value is greater than 0, a check is made af-
ter the subtraction operation takes place, to determine if the result is less than 0. If so, the
modulus is then added to the value, which will result in the value being larger than 0. This
check will cause a branch to occur in the GPU, which could slow down run time. Alterna-
tively, instead of performing a check to determine if the result is less than 0, the modulus
can be added to the first value, then the second value is subtracted (shown in the below
equation), which is a different approach to the subtraction operation. Equation 3.3 is rede-
fined below, with this procedure applied.
(0+(263−1))− (263−2) = (263−1)− (263−2) = 1 (3.4)
Now when the modulus operation occurs, the correct result is found, however none of the
values throughout the calculation were ever negative. Thus there is no need to perform a
check, avoiding the branch and possible slow down of the run time. This method however
does result in an overflow issue, because the modulus is being added to the first value.
Below is this approach applied to Equation 3.2.
((263−2)+(263−1))−0 = (264−3)−0 = (264−3) (3.5)
20
This equation generates the value 264− 3, which is too large for the signed integer range.
Thus the same procedure used for addition is performed here, to ensure overflow preven-
tion. The original numbers are cast to unsigned 64-bit integers. The operation detailed
above takes place, before the modulus operation brings the result back down to the signed
integer range. The result is then cast back to a signed integer, and completes the opera-
tion. The kernels for both subtraction operations, between two DoubleCRT objects and
between a DoubleCRT object and a constant are in Appendix A.2.
3.2.3 Multiplication Overflow Considerations
Multiplication presents many more problems compared to addition and subtraction. Again
the largest value possible is 263− 2, with the largest modulus being 263− 1. Performing
the multiplication operation on these values,
(263−2)∗ (263−2) = 2126−265+4 (3.6)
shows that very large numbers must be generated when performing the multiplication op-
eration. Where addition and subtraction results could fit in a possible data type (unsigned
64-bit integer), these values will not fit in any data type available in CUDA. Therefore an
algorithm must be used, which will break the original numbers into smaller pieces. These
pieces will then be used during intermediary steps to generate other values; that, when
combined back together will result in the correct answer, without ever generating a value
that cannot fit in the GPU. The algorithm that is used is Karatsuba’s algorithm.
21
Karatsuba’s Algorithm
x= x1Bm+ x0
y= y1Bm+ y0
z2 = x1y1
z1 = x1y0+ x0y1
z0 = x0y0
xy= (x1Bm+ x0)(y1Bm+ y0)
= z2B2m+ z1Bm+ z0
(3.7)
Equation 3.7 shows Karatsuba’s algorithm in general. The values being multiplied
are x and y. They are broken into pieces x1, x0 and y1, y0 respectively. These pieces are
then used to create z2, z1, and z0, which are finally combined with the base number, Bm, to
generate the original result. For this case, B= 2 and m= 32. These values will ensure that
operations performed throughout the execution of this algorithm never become greater
than the maximum 64-bit unsigned integer value. Each of these will be examined in-depth
below to see this.
When x = 263− 2, the variables x1 and x0 have values x1 = 231− 1 and x0 = 232− 2.
The exact same values are assigned to y1 and y0 since y= x= 263−2.
22
Computing z2,
z2 = x1y1
= (231−1)(231−1)
= 262−232+1
(3.8)
shows that z2 can fit inside a signed 64-bit integers, since the largest possible value is less
than 263−1 (the largest value possible for signed 64-bit integers).
Computing z1,
z1 = x1y0+ x0y1
= 2[(231−1)(232−2)]
= 2[263−233+2]
= 264−234+4
(3.9)
shows the intermediate pieces can be computed using signed 64-bit integers, however, the
addition operation causes the result to be in the range of unsigned 64-bit integers. Thus
this calculation requires casting the pieces to unsigned 64-bit integers, carrying out the
operation to calculate z1, then performing the modulus operation to bring z1 back down to
the signed 64-bit integer space.
Computing z0,
z0 = x0y0
= (232−2)(232−2)
= 264−234−4
(3.10)
23
shows that z0 must be calculated using unsigned 64-bit integers, since the result is too
large for signed 64-bit integers. Thus this calculation also requires casting the pieces to
unsigned 64-bit integers, performing the multiplication, before calculating the modulus to
bring the result back into the signed 64-bit integer range.
Now that each of the intermediate pieces have been addressed, the final piece needs
consideration. Computing xy,
xy= z2B2m+ z1Bm+ z0
= (262−232+1)(264) + (263−234+5)(232) + (263−234−3)
(3.11)
shows there will be problems computing z2B2m and z1Bm. However these can be dealt
with deterministically. By performing a loop, multiplying z2 and z1 by 2, 64 and 32 times
respectively, performing the modulus operation after every multiplication, one can obtain
the correct value, without exceeding the unsigned 64-bit integer limit. Final modulus op-
erations are applied to each addition, finally resulting in the correct value. This value is
cast back to a signed 64-bit integer, and the operation is complete. The kernels for both
multiplication operations, between two DoubleCRT objects and between a DoubleCRT
object and a constant, are in Appendix A.3.
3.3 Pipelining
As mentioned at the beginning of this chapter, there are three phases when performing op-
erations on a GPU. First, the data is copied from the host to the device, then the kernel is
24
Figure 3.3: Serial GPU Execution
Figure 3.4: Concurrent GPU Execution with 3 Streams
executed, and finally the data is copied back from the device to the host. The copy opera-
tions are handled by the copy engine, a processor that specifically deals with copying data
back and forth from the host and device. Kernels are executed by the kernel engine. This
operation is serial, meaning that all the data is copied from the host to device, before the
kernel is executed. And the memory is not copied back, until all of the kernels have fin-
ished executing. This is illustrated in Figure 3.3. CUDA allows for these operations to be
parallelized, using streams.
25
3.3.1 CUDA Streams
CUDA streams are a sequence of operations that execute in order on a GPU. Operations
in different streams can execute concurrently, and be interleaved. Figure 3.4 shows this
process applied to the same operations shown in Figure 3.3. One can see that this allows
for speedup, because the copy to/from host to device can be executed while the kernels
are being executed. Streams are useful for parallelizing the computation, however the or-
der in which the operations are dispatched also plays a role.
3.3.2 Overlapping Kernel Execution
There are two ways of dispatching GPU operations, all at once or by batching similar op-
erations together.
All At Once Method
The first approach launches all the operations at once, shown in Listing 3.1.
Listing 3.1: Operations launched all at once
for ( i n t i = 0 ; i < nSt reams ; i ++) {
i n t o f f s e t = i ∗ s t r e a m S i z e ;
cudaMemcpyAsync(& d a [ o f f s e t ] , &a [ o f f s e t ] , s t r e a m B y t e s ,
cudaMemcpyHostToDevice , s t r e a m [ i ] ) ;
k e r n e l<<<s t r e a m S i z e / b l o c k S i z e , b l o c k S i z e ,
0 , s t r e a m [ i ]>>>(d a , o f f s e t ) ;
cudaMemcpyAsync(&a [ o f f s e t ] , &d a [ o f f s e t ] , s t r e a m B y t e s ,
cudaMemcpyDeviceToHost , s t r e a m [ i ] ) ;
26
}All three phases are launched successively on the same stream, before the next stream’s
operations are launched.
Batching Method
The second approach is to launch similar operations together, instead of all at once. This
is show in Listing 3.2.
Listing 3.2: Operations batched
for ( i n t i = 0 ; i < nSt reams ; ++ i ) {
i n t o f f s e t = i ∗ s t r e a m S i z e ;
cudaMemcpyAsync(& d a [ o f f s e t ] , &a [ o f f s e t ] , s t r e a m B y t e s ,
cudaMemcpyHostToDevice , s t r e a m [ i ] ) ;
}
f o r ( i n t i = 0 ; i < nSt reams ; ++ i ) {
i n t o f f s e t = i ∗ s t r e a m S i z e ;
k e r n e l<<<s t r e a m S i z e / b l o c k S i z e , b l o c k S i z e ,
0 , s t r e a m [ i ]>>>(d a , o f f s e t ) ;
}
f o r ( i n t i = 0 ; i < nSt reams ; ++ i ) {
i n t o f f s e t = i ∗ s t r e a m S i z e ;
cudaMemcpyAsync(&a [ o f f s e t ] , &d a [ o f f s e t ] , s t r e a m B y t e s ,
cudaMemcpyDeviceToHost , s t r e a m [ i ] ) ;
}
In this second approach, all the copy from host to device operations are launched on their
respective streams, then the kernels are executed, and finally the device to host copies are
dispatched.
27
Figure 3.5: GPU 2-Way Pipelining
Both of these methods will produce the same result, as they are executing the same
commands. However, depending on the hardware being used, one method might achieve a
better speedup over the other.
3.3.3 2-Way Pipelining
Older GPU hardware only have a single copy engine and a single kernel execution en-
gine. Figure 3.5 shows both methods of kernel execution compared to serial execution.
As one can see, both provide a speed up, however the batching method provides a greater
speedup over the all at once method. This is because the all at once method launched the
second copy from host to device after the first copy from device to host. Because there is
only one copy engine, and the engine executes operations in the order they were launched,
the copy engine must execute the copy from device to host, before the next host to device
28
Figure 3.6: GPU 3-Way Pipelining
copy can occur. Using the second method though, the copies from host to device are all
launched before the copies from device to host, so they call all execute one after the other.
This allows for the most efficient pipelining, and the greatest speedup on hardware that
only has one of each engine.
3.3.4 3-Way Pipelining
Newer GPU hardware have both a copy from host to device engine and a copy from de-
vice to host engine, as well as the kernel engine. The two methods under this new hard-
ware configuration are shown in Figure 3.6. Here one can see that again both methods
provide an overall speedup, however for this case, the all at once method achieves the
29
greatest speedup. Even though the first device to host copy was issued before the sec-
ond host to device copy in the all at once method, the second host to device copy can be
executed earlier than it because of the two copy engines. This is what is allowing the all
at once execution method to be faster than the batching method. The batching method is
suffering from a design choice in the scheduler of the GPU. The GPU tries to execute the
kernels concurrently, and in doing so, delays a signal telling the device to host engine to
starting copying until all the kernels have finished execution.
3.3.5 Stream Management
Naively creating and destroying streams as they are needed causes avoidable run time
slowdown. By managing the available active streams, one can achieve the most efficient
design. For any instance of operation, the number of active streams is num rows in the
map. This number will rarely change, thus initially creating num rows amount of streams,
and only creating more streams when there are not enough, efficiently maintains the fewest
amount of needed streams at all times. Appendix C.1 contains the functions used to man-
age the streams.
30
CHAPTER 4
DISTRIBUTEDHELIB DESIGN
HElib is a prominent implementation of homomorphic encryption. However, it suffers
(like other implementations of homomorphic encryption) from slow run times. Therefore,
it must be speed up, before it will be used anywhere. HElib is meant to be deployed on
online services, most of which are designed as distributed systems. Thus there was the
idea to utilize the distributed system design present on these online services to facilitate
the run time speed up.
DistributedHElib attempts to speed up the run time of HElib by utilizing a distributed
system design. This is done by utilizing a cluster of compute nodes to parallelize opera-
tions. A cluster of nodes is a connected group of machines that communicate with each
other to achieve a common task. By delegating separate work to each machine, work can
be split between many machines, instead of a single machine handling the entire work
load. By doing this, run times can decrease, especially if each node is working on com-
pleting an independent task, which helps complete an overall global task.
HElib utilizes a single instruction, multiple data (SIMD) design, meaning that a sin-
31
gle instruction or operation is performed over many pieces of independent data. Unfor-
tunately, HElib does not take advantage of this design, because it is only single-threaded.
Meaning that while the operation has the potential to be concurrently computed, it is be-
ing serial computed. With a distributed design then, the hope is to split the data up, send it
to compute nodes, have those nodes perform the operations, and send the results back.
Having each node work simultaneously on separate pieces of the data can allow for a
speedup in run time compared to only having a single machine work on all the data.
For our design design, a master-slave architecture was chosen. This means there will
be one node, the dispatcher node, controlling the other nodes, the compute nodes. The
dispatcher node will be running HElib and when needed, will assign work to the compute
nodes.
There are a few phases when executing operations in a distributed computing envi-
ronment. First the cluster must be setup, and nodes must be designated as compute or
dispatcher nodes. Then the dispatcher node must assign work to the compute nodes. As
part of assigning work, the dispatcher node must partition the data, and send the pieces to
the respective compute nodes. The compute nodes must then perform the operations, and
send the data back. Finally the dispatcher node collects all the results and stores them,
before returning to regular execution. The cluster setup and work assignment phases are
discussed in Section 4.1. The partitioning of the data is discussed in Section 4.2. Finally
the methods by which the data are transmitted between nodes is discussed in Section 4.3.
32
4.1 Node Cluster Setup
Upon start up of a cluster of nodes, each must be assigned a job. For this design a master-
slave architecture is used. This means that one node must be designated the master node,
which will be the dispatcher node, and the others are designated slave nodes, or compute
nodes. The dispatcher node is the node responsible for running the serial portion of HE-
lib, and distributing the data to the worker nodes when a distributed part of computation
is reached. The compute nodes just wait for instructions from the dispatcher node, and act
accordingly when given tasks.
When starting a cluster with OpenMPI, the distributed computing communication in-
terface, each node is assigned a number, starting at 0 through num nodes− 1. Node 0
becomes the dispatcher node, and the others are compute nodes. The dispatcher node then
returns to normal program execution, while the compute nodes wait for messages from
the dispatcher node.
When the dispatcher node reaches a point of execution that is meant to be distributed,
it partitions the data (discussed in Section 4.2) and assigns each compute node a piece of
the data to operate on. The manner in which the dispatcher node chooses what data the
compute node will operate on is examined next.
33
Figure 4.1: Rolling Round Robin Example with More Nodes than Data Pieces
4.1.1 Work Assignment
After the data has been partitioned, it must be assigned to a compute node to be operated
on. The scheme used to choose the next compute node for a piece of data is round-robin
scheduling. This scheme allows for an even or almost even distribution of the data across
all the compute nodes. Having an even or almost even distribution means the lowest run
times and best efficiency. There are two cases to consider when determining if a distribu-
tion scheme is efficient: when there are more compute nodes than data pieces, and when
there are fewer compute nodes than data pieces.
For the first case, with the round-robin scheduling, this means that some nodes will
not be working, while others are. For this design a rolling round-robin design is used.
Meaning for any operations the next node to be assigned work will always be the node
assigned work the longest time ago. This node has the highest probability of being free
34
Figure 4.2: Rolling Round Robin Example with Less Nodes than Data Pieces
and ready to receive more work, compared to all the others. Figure 4.1 shows an example
of rolling round-robin scheduling applied to this case.
For the second case, with the round-robin scheduling, some nodes will have multiple
pieces of data to operate on. By using the round-robin scheduling though, the amount of
work done by each compute node should be about equal, and thus evenly spread over the
compute nodes. If work was unequally proportioned to a single compute node, compared
to others, then the dispatcher node might have to wait longer for the results before contin-
uing normal execution. This way the greatest run time and efficiency is achieved. Figure
4.2 shows an example of rolling round-robin scheduling applied to this case.
35
Figure 4.3: Data Mapping from Dispatcher to Compute Nodes
4.2 Memory Mapping
For compute nodes to execute, they must have a portion of the data to work on. This re-
quires the dispatcher node to partition the data from its current storage model into pieces
before they are sent to the compute nodes to be worked on. There are two pieces of data
that need to be mapped: the data that the operation is being performed on, and the moduli.
4.2.1 Mapping from Dispatcher Node to Compute Nodes
Currently data is stored as shown in Figure 4.3. The map contains vectors or rows, each
of these rows are arrays of 64-bit integers. The rows present a great partitioning point.
Splitting the data up by these rows, and sending individual rows to each compute node to
be operated on is the logical splitting point because there will always be about the same
number of rows during execution, whereas the size of the rows might change often. Also,
splitting rows would cause more communication between the nodes, which could slow
36
Figure 4.4: Moduli Mapping from Dispatcher to Compute Nodes
down run times.
Similarly, the moduli are being stored as individual elements. Because each moduli
is assigned to each row, and the rows are being assigned to a single compute node, each
moduli must only be sent to the specific compute node that the row it corresponds to is
on. Figure 4.4 shows the mapping process. Each modulus is only sent to the compute
node that its corresponding row is sent to.
4.2.2 Compute Node Vector Management
Naively creating and freeing buffers on the compute nodes that will receive the data from
the dispatcher node slows down run times. By creating a few buffers, and maintaining
them throughout the programs lifetime, the most efficient memory usage is achieved,
along with the greatest speedup.
37
Two buffers are created and maintained throughout the lifetime of the program. Both
buffers are of size size o f row. These are the buffers that the rows will be copied into on
the compute nodes. The size, size o f row, rarely changes, thus there will be little mem-
ory reallocation occurring. Also, reallocation will only occur when the buffer is too small,
not when it is larger than needed. The buffer will only ever grow, not shrink, thus cutting
down on the occurrences of reallocation needing to be performed. The function that han-
dles initializing and reallocation of these buffers is found in Appendix B.2.
4.3 Concurrency
Concurrency for a distributed system means that each node is executing operations si-
multaneously. To achieve concurrency in this system, the dispatcher node must be able
to assign work, and not have to wait for a response before assigning more work, and the
compute nodes must be able to perform computations at the same time. For all of these
operations to happened concurrently, the communication between the nodes must be non-
blocking.
4.3.1 Non-Blocking Send and Receive with OpenMPI
Blocking send and receive functions requires the data to be completely sent or received
before continuing execution. This means if a node were to call receive, it would wait
until it received data, before continuing execution. This can both be beneficial and detri-
38
mental, depending on the needs of the design. Non-blocking send and receive functions
however schedule requests, and then continue execution. These requests will later be
filled, but execution can continue. A request can be generated and even if the data has
not finished sending or been completely received, execution can continue. This request
can then be tested, and once the data has been sent or received, the request will be filled.
Both methods have their uses, discussed next.
Send and Receive on Compute Nodes
For the compute nodes, blocking send and receive functions are used, because it is nec-
essary for the buffer holding the result of the operation to be sent back to the dispatcher
node, before the next operation occurs. If the next operation occurred before the dis-
patcher node received the data, the buffer could be cleared or overwritten, and data sent
back would be incorrect.
Send and Receive on Dispatcher Node
For the dispatcher node, non-blocking send and receive functions are used. This allows
the dispatcher node to continue assigning work, even if the compute nodes have not fully
received their assignment. Also, the dispatcher node will use non-blocking receive func-
tions in order to receive data back as quickly as possible. If the dispatcher node used
blocking receive functions, then it could be waiting on a node that is taking a long time
to perform an operation, causing other compute nodes to wait that might have already fin-
39
ished computation, and may have pending computation that they could be moving onto.
Only using non-blocking receive functions could cause problems for the dispatcher
node, if, for example, two operations are performed, and the second requires the results
from the first. Without any mechanisms to ensure the result to the first operation is re-
ceived, before the execution of the second operation, unknown results can be computed.
Therefore there is a need for a syncing mechanism.
4.3.2 Syncing
A syncing mechanism will cause the dispatcher node to wait for all pending requests to
be completed before continuing execution. In this way, it can be ensured that a single op-
eration is fully completed before moving onto other operations. To keep track of these
requests, a queue is used. When a new request is created, through either a call to send
or receive, it is added to the end of the queue. When the sync function is called, each
of these is tested to see if they have completed. If a request has been completed, then
it is removed from the queue, if not, the function moves onto the next in the queue, and
checks it. The function only returns when all the pending requests have been filled, when
the queue is empty. The sync function can be found in Appendix C.2. The sync function
is performed after every operation(addition, subtraction, or multiplication), to ensure the
operation has completed, before moving onto the next operation.
40
CHAPTER 5
EVALUATION
The evaluation of this work has two objectives. First, to make sure that the modified de-
signs still produce correct output. Meaning that the result of an operation when decrypted
is the same for both the modified and unmodified versions of the library. Secondly, to pro-
file each design and compare run times of the modified libraries to the unmodified library.
These time comparisons will occur at multiple levels to best understand each design, and
how they compare to the serial version.
Distributed systems achieve the best efficiency when working on large inputs, not
small ones. This is because there is some overhead associated with setting up and dis-
tributing work. For GPU designs, this overhead time happens when transferring the data
to and from the GPU. For distributed computing designs, the overhead time is also in the
memory transfers like the GPU, but instead of transferring to the GPU, the memory trans-
fers are between machines. This overhead costs time, that when working with small in-
puts, usually takes even longer than the operation to complete. Thus the distributed design
causes a run time slow down compared to the serial version for small input sizes. How-
41
ever, when working with large amounts of data, the time saved to complete the operation
is so great, that the overhead costs are worth it. This characterization of distributed sys-
tems is prevalent for these designs, which will be seen later. For these encryption systems,
large input sizes occur when size o f row is large. For any distributed system, large has
a variable meaning that is relative to the system being examined. It could be anywhere
from a few hundred thousand to a few billion. For these libraries, large is defined to be
above a few hundred thousand. As discussed in the design chapters, size o f row is the
length of the vectors in map. To best demonstrate the effect size o f row has on the sys-
tem, size o f row is steadily increased during the testing, so its effect on each design can
be seen and compared to the original design.
To evaluate these designs a few profiling tools were used, discussed in Section 5.1.
The test environments used for each design are detailed in Section 5.2. The results for
GPUHElib are examined in Section 5.3, followed by the results for DistributedHElib in
Section 5.4. Finally some conclusions are drawn regrading both of these designs in Sec-
tion 5.5
5.1 Evaluation Tools
The following tools were used to evaluate the correctness of the modified libraries and
record run times at various levels in the library.
42
5.1.1 Test Program
A test program was created based on testing programs provided with the original unmodi-
fied version of HElib. This new test program first sets up the ciphertexts that are used dur-
ing computation. Two ciphertexts are created, both are the integers, 0 to num slots in plaintext.
Three operations are reported on in this document, of the six implemented. They are
addition, subtraction, and multiplication of one ciphertext with another ciphertext. The
other three operations supported, addition, subtraction, and multiplication of a ciphertext
and a number, were not reported, as preliminary tests showed that they displayed similar
results to their ciphertext-only counterparts. It was decided that for conciseness and to
avoid repetition, that they would be left out of this document.
The program requires that the user pass in the size o f row they would like to use,
which as discussed above, will be incremented during testing. The test program then per-
forms each operation, checking the decrypted result after each operation to make sure that
the results are correct, before moving on. Timing blocks were placed around each oper-
ation to record the overall time it took to perform the operation. The timers were printed
out after each operation. Lower level timers, discussed below, were reset after each oper-
ation, so the lower level times reported with each operation were only times recorded for
that operation.
This test program was compiled twice for both designs, first linking against the un-
43
modified library, and second against the modified library. This produced two executables,
that were then run to generate the results described below.
5.1.2 HElib Timing Functions
The standard version of HElib provides fine grained timing functions that can be placed
anywhere throughout the library. To utilize these functions is simple. First a timer is cre-
ated. Upon the creation of the timer, it is started. A call to stop is made in code when
the timer should stop. A timer can be started and stopped multiple times, and the aver-
age time will be recorded. The timers are stored in a map, and can be reset if needed.
This setup allows for fine grained measurement of functions and detailed profiling of run
times.
Each design, serial HElib, GPUHElib, and DistributedHElib required the timers be
placed at some similar places, for comparison purposes, and some distinct places, in order
to assess the efficiency of each unique design.
Serial HElib Timer Placement
There were two levels at which timers were placed, with each successive level more fine
grained. The first level was in the test program, described above. This is the circuit level.
The other timer was placed at the function level, inside the function that performed the
operations in DoubleCRT. This function was where the double for loop was located that
44
both of the distributed designs are trying to improve upon.
GPUHElib Timer Placement
For GPUHElib, there are three levels at which timers were placed. As described above,
the first timer was placed at the circuit level in the test program. This records the time that
the entire operation took. The second level of timing was the function level. At this level
a timer was placed in the function that is performing the operations in DoubleCRT. These
first two timer levels allow for comparison against the serial version, as the serial version
also has timers placed at these levels. The third and lowest level is the phase level. Four
timers are placed at this level to record the setup (vector and stream creation), phase one
(host to device memory transfer), phase two (GPU computation), and phase three (device
to host memory transfer) times. The timers at this level allow for the times gathered at
level two to be broken down even further, and allow each phase of level two to be exam-
ined closer.
DistributedHElib Timer Placement
For DistributedHElib, there are three levels at which timers were placed. The first at the
circuit level in the test program. The second at the function level, in DoubleCRT, where
the operation is being performed. These first two timer levels are necessary for the com-
parison against the serial version, as the serial version also has timers at these two levels.
The third consists of two timers: the distribute (where the job assignment and data parti-
45
tioning happens) and the wait (where the sync function is called, and the dispatcher node
is waiting for the compute nodes to finish and send back their results). This third level
allows for a breakdown of the function level for further analysis.
5.2 Testing Environment
Both systems required unique testing environments that had the capabilities needed for
each design. Both variants used machines with 64-bit Intel Core 2 Duo CPUs running at
3.0 Ghz with about 4 Gb of DDR2, 800 Mhz RAM.
5.2.1 GPU Testing Environment
GPUHElib was tested on a machine with a NVIDIA Quadro NVS 290 GPU, which has
256 MB of RAM. Also this particular GPU only has one copy engine. Thus the tests re-
ported here are using the 2-Way Pipelining design discussed in Section 3.3. CUDA ver-
sion 6.5 was used.
5.2.2 Distributed Computing Testing Environment
DistributedHElib was tested on a cluster of machines all connected through Ethernet.
OpenMPI version 1.8.5 was used. Three cluster configurations were used, one with 4
nodes, one with 8 nodes, and one with 16 nodes.
46
5.3 GPUHElib Evaluation Results
As discussed earlier in this chapter, GPUHElib has three levels of timing information be-
gin recorded. The first, and highest level, is the circuit level, where the high level opera-
tion is being computed. The second level is the function level, inside DoubleCRT where
the parts are being operated on. And the lowest level is the phase level, where timing re-
sults are recorded for all four phases of the operation. The timing results for each of these
levels is discussed in more detail in the following sections.
5.3.1 GPUHElib Circuit Level Run Times
Table 5.1 and Table 5.2 display the run times for serial HElib and GPUHElib tests respec-
tively. Both tests were run with inputs sizes starting at 1,000 and increasing until 400,000.
Figure 5.1 visualizes these times as the slowdown of GPUHElib over serial HElib.
A value of 1 means that the serial version and the GPU version had the same run time.
Above 1 means that the GPU design has a slower run time, and below 1 means that the
GPU has a faster run time.
One can see in Figure 5.1 that for the smaller input sizes, the run times for the GPU
are much larger than the serial version for addition and subtraction. They started off tak-
ing about 72 times as long to complete, compared to the serial version. The multiplica-
tion operation also starts off taking longer, however only about 2.2 times as long. The run
47
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.400E-05 1.080E-04 2.545E-03 5.396E-03 5.354E-03 1.053E-02
Sub 1.400E-05 1.300E-04 2.532E-03 5.304E-03 5.366E-03 1.075E-02
Mul 4.962E-03 1.036E-01 1.157 2.648 7.217 1.232E+01
Table 5.1: Serial HElib circuit level run times (in seconds)
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.050E-03 1.675E-03 8.972E-03 1.646E-02 1.671E-02 3.704E-02
Sub 1.012E-03 1.805E-03 9.023E-03 1.683E-02 1.682E-02 3.700E-02
Mul 1.106E-02 1.187E-01 1.309 2.868 7.393 9.333
Table 5.2: GPUHElib circuit level run times (in seconds)
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
80
70
60
50
40
30
20
10
0
Sl
ow
do
w
n
Add
Sub
Mul
Figure 5.1: Run Time Comparison at Circuit Level
48
times for serial HElib and GPUHElib get closer and closer, as the inputs sizes approach
400,000. The addition and subtraction circuit run times minimize at about 3 times as long,
for the 300,000 size input. However for the 400,000 size input, the times go in the oppo-
site direction desired, becoming about 3.5 times as long. The multiplication circuit actu-
ally has the best results, with the 400,000 test taking about .75 times the serial version.
The multiplication circuit took about 3/4 the time to complete in GPUHElib compared to
the serial version. While this result might look good, further analysis of the lower level
tests show that this was probably not caused by the usage of the GPU, but by other oper-
ations computed during the multiplication operation being faster. The function level run
times will be examined next.
5.3.2 GPUHElib Function Level Run Times
Table 5.3 and Table 5.4 display the run times for the serial HElib and GPUHElib tests
respectively. As noted before, both tests were run with input sizes ranging from 1,000 to
400,000.
Figure 5.2, Figure 5.3, and Figure 5.4 show the comparisons between the run times for
each of the operations at the function level. Also displayed in the figures is the run time
slow down of the GPU variant compared to the serial version. For example, in Figure 5.2,
the 104.2x above 1,000 means that the GPU variant took 104.2 times longer to complete
compared to the serial version.
49
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 5.000E-06 4.825E-05 1.007E-03 2.648E-03 2.653E-03 5.466E-03
Sub 5.000E-06 6.150E-05 1.259E-03 2.644E-03 2.674E-03 5.366E-03
Mul 2.900E-05 2.830E-04 2.879E-03 5.863E-03 5.856E-03 1.176E-02
Table 5.3: Serial HElib function level run times (in seconds)
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 5.210E-04 8.298E-04 4.392E-03 8.257E-03 8.346E-03 1.864E-02
Sub 5.020E-04 8.950E-04 4.498E-03 8.398E-03 8.395E-03 1.848E-02
Mul 5.302E-04 1.006E-03 6.599E-03 1.273E-02 1.276E-02 2.687E-02
Table 5.4: GPUHElib function level run times (in seconds)
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
·10−2
104.2x 17.2x
4.4x
3.1x 3.1x
3.4x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib GPUHELib
Figure 5.2: Add Run Times Comparison at Function Level
50
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
·10−2
100.4x 14.6x
3.6x
3.2x 3.1x
3.4x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib GPUHELib
Figure 5.3: Sub Run Times Comparison at Function Level
51
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.5
1
1.5
2
2.5
·10−2
18.3x 3.6x
2.3x
2.2x 2.2x
2.3x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib GPUHELib
Figure 5.4: Mul Run Times Comparison at Function Level
52
All these times again reiterate that for the smaller input sizes, the run times for the
GPU variant are vastly larger, almost 100x for addition and subtraction, and about 18x for
multiplication. As the input sizes increase, the run times get closer and closer, but mini-
mize at about 3.1x for addition and subtraction and at about 2x for multiplication. Again
one can see that for the 400,000 input size, the slow downs increase for all the operations
compared to the previous input size, 300,000, going from 3.2x to 3.4x and 3.5x respec-
tively and from 2.2x to 2.3x. The results for the multiplication times make it clear that
the speedup seen at the circuit level must be caused by other factors than the GPU imple-
mentation of multiplication. These times show that multiplication behaves exactly like the
other operations in terms of run time patterns. The cause for these slow downs is evident
after examining the recorded times at the phase level.
5.3.3 GPUHElib Phase Level Run Times
Table 5.5, Table 5.6, and Table 5.7 all display the phase level run times for each operation
respectively. The four phases are as follows: setup (vector and stream creation), phase 1
(host to device memory copy), phase 2 (operation on GPU), and phase 3 (device to host
memory copy). These times have been split between two plots for each operation. One
group of plots focuses on the overall run time of serial HElib compared to the setup and
phase 2 times recorded. These are the “Operation” plots, Figure 5.5, Figure 5.7, and Fig-
ure 5.9. The other group of plots focus on the overhead phases, phase 1 and phase 3, of
53
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Setup 2.420E-04 2.465E-04 2.848E-04 3.013E-04 3.015E-04 3.133E-04
Phase 1 9.500E-05 2.568E-04 2.137E-03 4.289E-03 4.345E-03 1.164E-02
Phase 2 4.425E-05 3.350E-05 1.283E-04 2.423E-04 2.420E-04 3.728E-04
Phase 3 1.373E-04 2.900E-04 1.837E-03 3.420E-03 3.454E-03 6.308E-03
Table 5.5: GPUHElib Add phase level run times (in seconds)
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
1
2
3
4
5
6
·10−3
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Setup Phase 2
Figure 5.5: Add Phase Level Run Times Comparison - Operation
54
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
·10−2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Phase 1 Phase 3
Figure 5.6: Add Phase Level Run Times Comparison - Memory
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Setup 2.455E-04 2.830E-04 3.035E-04 3.065E-04 3.045E-04 3.050E-04
Phase 1 9.000E-05 2.535E-04 2.262E-03 4.377E-03 4.377E-03 1.152E-02
Phase 2 3.400E-05 4.200E-05 1.280E-04 2.420E-04 2.425E-04 3.745E-04
Phase 3 1.275E-04 3.120E-04 1.799E-03 3.466E-03 3.465E-03 6.275E-03
Table 5.6: GPUHElib Sub phase level run times (in seconds)
55
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
1
2
3
4
5
6
·10−3
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Setup Phase 2
Figure 5.7: Sub Phase Level Run Times Comparison - Operation
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Setup 2.405E-04 2.447E-04 3.027E-04 3.200E-04 3.157E-04 3.207E-04
Phase 1 7.933E-05 1.853E-04 1.845E-03 3.955E-03 3.997E-03 1.029E-02
Phase 2 3.000E-05 3.183E-05 1.247E-04 2.398E-04 2.353E-04 3.508E-04
Phase 3 1.772E-04 5.418E-04 4.323E-03 8.213E-03 8.212E-03 1.591E-02
Table 5.7: GPUHElib Mul phase level run times (in seconds)
56
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
·10−2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Phase 1 Phase 3
Figure 5.8: Sub Phase Level Run Times Comparison - Memory
57
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
·10−2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Setup Phase 2
Figure 5.9: Mul Phase Level Run Times Comparison - Operation
58
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
−0.2
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
·10−2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib Phase 1 Phase 3
Figure 5.10: Mul Phase Level Run Times Comparison - Memory
59
GPUHElib compared to the overall run time of serial HElib for each operation and are
denoted “Memory”. These are Figure 5.6, Figure 5.8, and Figure 5.10.
The “Operation” plots show the design is working as intended and as all GPU de-
signs are ideally suppose to work. The reliance on the GPU to perform the computation
drastically reduces the run time for those phases. Furthermore, as the input size increases
largely, the run times for setup and phase 2 only increase slightly.
The setup phase always took about the same amount of time no matter the input size,
only varying by about 8E-05 from the smallest input to the largest. Phase 2 times steadily
increased, however not at the rapid pace of the serial HElib versions. This is what is ex-
pected when using a GPU to perform operations.
These are the desired results when working with a GPU. The offloading of work onto
the GPU allows for the operation portions of the work to drastically decrease in run time.
Of course, these results do not characterize the overall recorded times for GPUHElib
compared to serial HElib. Therefore something else must be going wrong, causing the
run times to be longer than the their serial counterparts.
The “Memory” plots show where this design fails. The amount of time needed to
move the data back and forth from the GPU is immense. The times for phase 1 and phase
3 are always larger than the entire run time of the serial version for every operation across
all input sizes, except for the multiplication operation, where the phase 1 times are actu-
ally less than the overall run time of the serial version, but not by much. These times are
60
very disappointing, as they are the reason this design is performing so poorly. Luckily,
these times could be lower, given better hardware and possible future work, which could
make GPUHElib a viable option. If the problem was in the design or if the run times for
the setup or phase 1 were worse, then the total design would not have any hope of being
used. But because they are in the memory transfer phases, there is still hope that this de-
sign could become viable with further work.
5.4 DistrubtedHElib Evaluation Results
As discussed earlier in this chapter, DistributedHElib has three levels of timing informa-
tion begin recorded. The highest level is the circuit level in the test program. The next is
at the function level, inside DoubleCRT. The third and lowest level is the distribute and
wait level. This level has two timers which measure the distribute time, the time necessary
for the dispatcher node to assign work and partition the data, and the wait time, the time
the dispatcher node is waiting for the compute nodes to finish their work and return the
results. Additionally three cluster sizes, 4, 8, and 16 nodes, were used during testing The
timing results for each of these levels is discussed in more detail below.
5.4.1 DistributedHElib Circuit Level Run Times
Table 5.8, Table 5.9, Table 5.10, and Table 5.11 display the run times for serial HElib and
DistributedHElib tests. Both tests were run with inputs sizes starting at 1,000 and increas-
61
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.400E-05 1.070E-04 2.525E-03 5.347E-03 5.313E-03 1.048E-02
Sub 1.400E-05 1.250E-04 2.491E-03 5.308E-03 5.244E-03 1.080E-02
Mul 4.996E-03 1.030E-01 1.151 2.644 7.286 1.202E+01
Table 5.8: Serial HElib circuit level run times (in seconds)
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.457E-03 7.496E-03 8.933E-02 1.783E-01 1.550E-01 3.455E-01
Sub 1.544E-03 7.479E-03 7.773E-02 1.666E-01 1.647E-01 3.203E-01
Mul 1.254E-02 1.371E-01 1.559 3.444 8.350 1.009E+01
Table 5.9: DistributedHElib circuit level run times (in seconds) on 4 nodes
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
120
100
80
60
40
20
0
Sl
ow
do
w
n
Add
Sub
Mul
Figure 5.11: Run Time Comparison at Circuit Level on 4 Nodes
62
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.712E-03 9.161E-03 1.076E-01 2.228E-01 2.266E-01 4.540E-01
Sub 1.791E-03 9.219E-03 1.117E-01 2.113E-01 2.139E-01 4.560E-01
Mul 1.350E-02 1.454E-01 1.666 3.689 8.589 1.136E+01
Table 5.10: DistributedHElib circuit level run times (in seconds) on 8 nodes
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
140
120
100
80
60
40
20
0
Sl
ow
do
w
n
Add
Sub
Mul
Figure 5.12: Run Time Comparison at Circuit Level on 8 Nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 1.772E-03 9.960E-03 1.187E-01 2.391E-01 2.390E-01 4.911E-01
Sub 1.850E-03 9.741E-03 1.194E-01 2.398E-01 2.364E-01 4.744E-01
Mul 2.484E-02 1.623E-01 1.717 3.811 8.664 1.135E+01
Table 5.11: DistributedHElib circuit level run times (in seconds) on 16 nodes
63
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
140
120
100
80
60
40
20
0
Sl
ow
do
w
n
Add
Sub
Mul
Figure 5.13: Run Time Comparison at Circuit Level on 16 Nodes
64
ing until 400,000. The clusters sizes were 4, 8, and 16 nodes.
Figure 5.11, Figure 5.12, and Figure 5.13 visualize these times as the slowdown of
DistributedHElib over serial HElib. A value of 1 means that the serial version and the
distributed version had the same run time. Above 1 means that the distributed design has
a slower run time, and below 1 means that the distributed version has a faster run time.
One can see in these figures that for the smaller input sizes, the run times for addition
and subtraction across all cluster sizes are much larger for the distributed design com-
pared to the serial design. These operations take over 100 times as long to complete as
their serial counterparts. The multiplication operation also takes longer, however only
about 2.5 to 5 times as long depending on the number of nodes in the cluster.
As the input sizes increase, the addition and subtraction operation slowdowns de-
crease, until they plateau at around 35, 40, and 45 times as long for the 4, 8, and 16 node
clusters respectively. Once they reach these slowdowns, they bounce around, but never
continue on the downward trajectory they had for the first few input size increases. The
multiplication operation on the other hand always decreases as the input size increases.
As the input sizes increase, the multiplication slowdowns approach 1, and at the 400,000
size input, all three cluster sizes dive below 1. The cluster size of 4 has the best results,
having about a .84 slowdown, with the other two sizes, 8 and 16, having about a .94.
These results mean, for the 400,000 input size, the distributed variant of HElib had faster
run times than the serial version across all cluster sizes. Similar to the GPU results, while
65
these times look good, further examination of the lower level tests show that this speed up
is probably not happening because of the distributed design, but because of other factors.
Next the function level times are examined.
5.4.2 DistributedHElib Function Level Run Times
Table 5.12, Table 5.13, Table 5.14, and Table 5.15 display the run times at the function
level for serial HElib and DistributedHElib on 4, 8, and 16 nodes.
Figure 5.14, Figure 5.15, and Figure 5.16 show the comparisons between the run
times for each of the operations at the function level across all variants and cluster sizes.
Also displayed in the figures is the average run time slow down, across all cluster sizes, of
the distributed variant compared to the serial version. So, for example, in Figure 5.14, the
168.7x above 1,000 means that the distributed variant took 168.7 times longer to complete
compared to the serial version.
Again for the smaller inputs, these figures show that the distributed variant takes much
longer to complete compared to the serial version. For addition and subtraction, the op-
erations take about 170 times as long, and for multiplication, about 25 times as long. As
the input sizes increase, the slow downs do decline, however level out around the 200,000
size input. The addition and subtraction operations level out at about 40x, and the multi-
plication operation levels out at about 16x. Again one can see that the run times plateau
for the addition and subtraction operations, just as they did at the circuit level. The results
66
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 5.000E-06 4.800E-05 9.873E-04 2.588E-03 2.611E-03 5.457E-03
Sub 5.000E-06 5.850E-05 1.238E-03 2.646E-03 2.614E-03 5.391E-03
Mul 2.967E-05 2.838E-04 2.887E-03 5.845E-03 5.850E-03 1.183E-02
Table 5.12: Serial HElib function level run times (in seconds)
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 7.423E-04 3.604E-03 4.898E-02 8.503E-02 8.048E-02 1.764E-01
Sub 7.695E-04 3.735E-03 3.886E-02 8.329E-02 8.233E-02 1.601E-01
Mul 7.148E-04 3.298E-03 3.517E-02 7.759E-02 7.771E-02 1.482E-01
Table 5.13: DistributedHElib function level run times (in seconds) on 4 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 8.698E-04 4.427E-03 5.489E-02 1.124E-01 1.125E-01 2.186E-01
Sub 8.930E-04 4.427E-03 5.583E-02 1.056E-01 1.069E-01 2.280E-01
Mul 8.055E-04 4.032E-03 4.830E-02 9.798E-02 9.804E-02 1.951E-01
Table 5.14: DistributedHElib function level run times (in seconds) on 8 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 9.180E-04 7.439E-03 5.794E-02 1.166E-01 1.152E-01 2.316E-01
Sub 9.220E-04 4.866E-03 5.971E-02 1.199E-01 1.182E-01 2.372E-01
Mul 8.373E-04 4.498E-03 5.036E-02 1.025E-01 1.054E-01 2.126E-01
Table 5.15: DistributedHElib function level run times (in seconds) on 16 nodes
67
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
0.25
168.7x 107.4x
54.6x
40.5x 39.4x
38.3x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.14: Add Run Times Comparison at Function Level
68
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
0.25
172.3x 74.2x
41.6x
38.9x 39.2x
38.7x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.15: Sub Run Times Comparison at Function Level
69
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
26.5x 13.9x
15.5x
15.9x 16.0x
15.7x
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.16: Mul Run Times Comparison at Function Level
70
for the multiplication operation at this level show it also has this characteristic, whereas
at the circuit level, this characteristic was not observed. Thus the results observed at the
circuit level must not be the result of the distributed design, but something else. By look-
ing at the distribute and wait times at level 3, one can understand why these results are
occurring.
5.4.3 DistributedHElib Distribute and Wait Run Times
Table 5.16, Table 5.18, and Table 5.20 display the distribute run times for each operation
across the three cluster sizes. Table 5.17, Table 5.19, and Table 5.21 display the sync run
times for each operation across the three cluster sizes. These times have been split into
two plots for each operation. One group of plots focuses on the distribute times (the time
it took to partition the data and assign the work) and compares them to the overall run
time of the serial version. These are the “Distribute” plots, Figure 5.17, Figure 5.19, and
Figure 5.21. The second group of plots display the sync time (the time the compute nodes
took to receive the data, compute the results, and send the data back to the dispatcher
node) compared to the overall run time for the serial design. Figure 5.18, Figure 5.20,
and Figure 5.22 display these results, and are the “Sync” plots.
By looking at the “Distribute” plots, one can see that the partitioning of data and as-
signment of work times across all clusters sizes and operations remains constant even
when the input sizes are increased. This looks good, but remember that this part of the
71
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 2.240E-04 4.638E-04 2.373E-04 2.500E-04 2.438E-04 2.698E-04
Sub 2.405E-04 5.700E-04 2.470E-04 2.760E-04 2.995E-04 2.340E-04
Mul 2.015E-04 4.190E-04 2.150E-04 2.585E-04 2.668E-04 2.700E-04
Table 5.16: DistributedHElib distribute run times (in seconds) on 4 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 5.158E-04 3.137E-03 4.874E-02 8.478E-02 8.023E-02 1.761E-01
Sub 5.260E-04 3.162E-03 3.861E-02 8.301E-02 8.203E-02 1.599E-01
Mul 5.110E-04 2.875E-03 3.495E-02 7.733E-02 7.744E-02 1.479E-01
Table 5.17: DistributedHElib sync run times (in seconds) on 4 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 2.773E-04 6.288E-04 2.740E-04 2.978E-04 2.913E-04 3.055E-04
Sub 2.915E-04 7.465E-04 3.015E-04 2.950E-04 2.955E-04 3.085E-04
Mul 2.427E-04 5.405E-04 2.872E-04 3.077E-04 2.953E-04 3.198E-04
Table 5.18: DistributedHElib distribute run times (in seconds) on 8 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 5.903E-04 3.796E-03 5.461E-02 1.121E-01 1.122E-01 2.183E-01
Sub 5.940E-04 3.855E-03 5.553E-02 1.053E-01 1.066E-01 2.277E-01
Mul 5.608E-04 3.490E-03 4.801E-02 9.767E-02 9.774E-02 1.948E-01
Table 5.19: DistributedHElib sync run times (in seconds) on 8 nodes
72
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 2.985E-04 7.418E-04 2.923E-04 3.083E-04 3.515E-04 3.235E-04
Sub 3.130E-04 8.175E-04 3.270E-04 3.190E-04 3.270E-04 3.445E-04
Mul 2.548E-04 6.265E-04 2.853E-04 3.130E-04 3.157E-04 3.338E-04
Table 5.20: DistributedHElib distribute run times (in seconds) on 16 nodes
size o f row
1,000 10,000 100,000 200,000 300,000 400,000
Add 6.178E-04 6.692E-03 5.765E-02 1.163E-01 1.149E-01 2.313E-01
Sub 6.060E-04 4.045E-03 5.938E-02 1.196E-01 1.178E-01 2.368E-01
Mul 5.807E-04 3.869E-03 5.007E-02 1.022E-01 1.051E-01 2.123E-01
Table 5.21: DistributedHElib sync run times (in seconds) on 16 nodes
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
1
2
3
4
5
6
·10−3
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.17: Add Third Level Run Times Comparison - Distribute
73
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
0.25
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.18: Add Third Level Run Times Comparison - Sync
74
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
1
2
3
4
5
6
·10−3
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.19: Sub Third Level Run Times Comparison - Distribute
75
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
0.25
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.20: Sub Third Level Run Times Comparison - Sync
76
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
0.2
0.4
0.6
0.8
1
1.2
·10−2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.21: Mul Third Level Run Times Comparison - Distribute
77
1,0
00
10
,00
0
10
0,0
00
20
0,0
00
30
0,0
00
40
0,0
00
0
5 ·10−2
0.1
0.15
0.2
R
un
Ti
m
es
(i
n
se
co
nd
s)
Serial HElib 4 Nodes 8 Nodes 16 Nodes
Figure 5.22: Mul Third Level Run Times Comparison - Sync
78
work only records the times it takes to partition the data and assign the work, not send the
data to the compute nodes. Only non-blocking sends and receives are scheduled during
this portion of the recording. The actual sending and receiving between the dispatcher and
compute nodes happens mostly during the sync portion of the recording. This is because
the non-blocking send and receive functions only schedule requests, which are later ful-
filled in the background, during the sync part. So it is to be expected that these results are
seen.
The “Sync” plots show where this design is failing. The amount of time needed to
send the data over the network to the compute nodes, have them perform the operation
and then send the data back is much greater than the serial times. This has to do with the
network speed, which causes a bottleneck, and which is responsible for the plateau char-
acteristic seen in the circuit and function level times. The network speeds are much too
slow in this environment to allow for this design to be viable. With better speeds (ones
which would not cause this bottleneck to happen) this design might provide better results.
Additionally, future work to try and minimize the amount of data transfers between the
dispatcher and compute nodes could also lead to better run times.
5.5 Evaluation Conclusions
As mentioned in the introduction of this chapter, when working with distributed systems,
the overhead time, which comes from the memory transfer phases of each design, usu-
79
ally is the downfall of distributed systems. The same is true for these designs. GPUHElib
suffered from slow transfer times between the CPU and GPU, which caused slow downs
compared to the serial version. Similarly, DistributedHElib suffered from slow network
speeds, which caused a bottleneck when transferring data between the dispatcher node
and the compute nodes. With better transfer speeds, both of these designs might be viable,
however under these circumstances, they are not.
80
CHAPTER 6
FUTURE WORK
6.1 GPUHElib Future Work
As seen in Chapter 5, GPUHElib failed to provide any speed up over serial HElib. This
was because the memory transfer times between the CPU and GPU where much too great.
Therefore all of the future work for GPUHElib is designed around trying to reduce these
times.
6.1.1 Persistent Memory in GPU
To cut down on memory transfer times, it would be useful to keep all the memory that
needs to be transferred in the current design, in the GPUs memory permanently. Thus
when the operations need to be performed, the data does not need to be transferred from
the CPU to the GPU, as the GPU already contains all the data that needs to be operated
on. This would reduce the memory transfer times down to almost zero, thus making the
run times only dependent on the operation times, which as seen in Section 5.3, were much
81
lower than the serial HElib run times.
6.1.2 Full Operation Implementation
Only the most used operations were implemented on the GPU. Because of this, the results
from the GPU operations needed to be copied back to the CPU, so that other unsupported
operation could take place. It would be beneficial then to implement all of the operations
supported by serial HElib using the GPU. Thus the results would not need to be copied
back, as all the operations that a user could perform would be supported on the GPU.
6.2 DistributedHElib Future Work
As seen in Chapter 5, DistributedHElib failed to provide any run time improvement over
serial HElib. This was a result of the network speed being a bottleneck. Thus all future
work for DistributedHElib would try to reduce amount of data sent over the network.
6.2.1 Distributed Memory on Compute Nodes
In order to rely less on the network to transfer data, one could design the system so that
the data is partitioned onto compute nodes. Thus each compute node would be respon-
sible for a piece that only they were responsible for. Then when operations occur, these
compute nodes can perform the operation on their piece of the data, without needing to
receive the data from the dispatcher node, as the data will already be present on them.
82
Thus data transfers would only happen during initial setup, and finalization, reducing the
network traffic, and removing the network bottleneck.
6.2.2 Full Operation Implementation
Similar to the future work for GPUHElib, it would be useful to have all the operations
supported on the compute nodes. Because they are not supported in the library as it is
designed now, the results of operations must be sent back to the dispatcher node, so that it
can perform one of these unsupported operations. With all operations supported however,
the data would not need to be sent back to the dispatcher node to perform the operation,
because the operation could be run on the compute nodes.
83
CHAPTER 7
CONCLUSIONS
This work was attempting to improve the run time of the homomorphic encryption library,
HElib. By applying distribute system techniques, which would suit the intended deploy-
ment environment for HElib, we tried to improve the run times of operations being per-
formed by HElib. Two libraries were designed: GPUHElib and DistributedHElib.
GPUHElib attempted to add GPU functionality to HElib, in order to cut down on the
run times for the operations. The design of these operations on the GPU required mem-
ory mapping from the CPU to the GPU, overflow considerations for GPU operations, and
pipeline techniques be applied. Unfortunately, as these tests showed, this design did not
perform better than the serial version. This was due to the memory transfer times from
CPU to GPU being much too large to facilitate a speedup, even though the operation
times were much lower. With further work however, this design might become viable.
DistributedHElib applied distributed computing techniques in an attempt to speed up
the run time of HElib. This design utilized a master-slave cluster architecture and non-
block send and receive function to facilitate concurrent computation. Again as the tests
84
showed, this design failed to perform better than the serial version. This was because the
network speed caused a bottleneck, and made the run times drastically slower than the
serial version. Similarly, with future work, this design might become a viable option.
While both of these variants were unsuccessful, they have promise, and in the future
might be useful in the design of faster homomorphic encryption libraries.
85
BIBLIOGRAPHY
[1] Zvika Brakerski. Fully homomorphic encryption without modulus switching from classical
gapsvp. Cryptology ePrint Archive, Report 2012/078, 2012. http://eprint.iacr.
org/.
[2] Zvika Brakerski, Craig Gentry, and Shai Halevi. Packed ciphertexts in lwe-based homomor-
phic encryption. In Public-Key Cryptography–PKC 2013, pages 1–13. Springer, 2013.
[3] Zvika Brakerski, Craig Gentry, and Vinod Vaikuntanathan. Fully homomorphic encryption
without bootstrapping. Cryptology ePrint Archive, Report 2011/277, 2011. http://
eprint.iacr.org/.
[4] Zvika Brakerski and Vinod Vaikuntanathan. Efficient fully homomorphic encryption from
(standard) lwe. Cryptology ePrint Archive, Report 2011/344, 2011. http://eprint.
iacr.org/.
[5] Jean-Se´bastien Coron, David Naccache, and Mehdi Tibouchi. Public key compression and
modulus switching for fully homomorphic encryption over the integers. In Advances in
Cryptology–EUROCRYPT 2012, pages 446–464. Springer, 2012.
[6] Craig Gentry. A fully homomorphic encryption scheme. PhD thesis, Stanford University,
2009.
[7] Craig Gentry and Shai Halevi. Implementing gentry’s fully-homomorphic encryption
scheme. Cryptology ePrint Archive, Report 2010/520, 2010. http://eprint.iacr.
org/.
[8] Craig Gentry and Shai Halevi. Fully homomorphic encryption without squashing using
depth-3 arithmetic circuits. Cryptology ePrint Archive, Report 2011/279, 2011. http:
//eprint.iacr.org/.
86
[9] Craig Gentry, Amit Sahai, and Brent Waters. Homomorphic encryption from learning with
errors: Conceptually-simpler, asymptotically-faster, attribute-based. Cryptology ePrint
Archive, Report 2013/340, 2013. http://eprint.iacr.org/.
[10] Shai Halevi and Victor Shoup. Design and implementation of a homomorphic-encryption
library. 2013.
[11] Shai Halevi and Victor Shoup. Algorithms in helib. Cryptology ePrint Archive, Report
2014/106, 2014. http://eprint.iacr.org/.
[12] Shai Halevi and Victor Shoup. Bootstrapping for helib. Cryptology ePrint Archive, Report
2014/873, 2014. http://eprint.iacr.org/.
[13] Adriana Lopez-Alt, Eran Tromer, and Vinod Vaikuntanathan. On-the-fly multiparty com-
putation on the cloud via multikey fully homomorphic encryption. Cryptology ePrint
Archive, Report 2013/094, 2013. http://eprint.iacr.org/.
[14] Ronald L Rivest, Len Adleman, and Michael L Dertouzos. On data banks and privacy
homomorphisms. Foundations of secure computation, 4(11):169–180, 1978.
[15] Nigel P Smart and Frederik Vercauteren. Fully homomorphic encryption with relatively
small key and ciphertext sizes. In Public Key Cryptography–PKC 2010, pages 420–443.
Springer, 2010.
[16] N.P. Smart and F. Vercauteren. Fully homomorphic simd operations. Cryptology ePrint
Archive, Report 2011/133, 2011. http://eprint.iacr.org/.
[17] Damien Stehle and Ron Steinfeld. Faster fully homomorphic encryption. Cryptology
ePrint Archive, Report 2010/299, 2010. http://eprint.iacr.org/.
87
APPENDICES
88
APPENDIX A
KERNELS
A.1 Addition
A.1.1 Addition of two DoubleCRT objects
Listing A.1: Addition Kernel for Two DoubleCRT
g l o b a l void vectorAddMod ( long ∗ vec to r A ,
long ∗ v e c t o r B ,
long width ,
long sub wid th ,
long o f f s e t ) {
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compute sum
u i n t 6 4 t sum =
( u i n t 6 4 t ) v e c t o r A [ t i d ] +
( u i n t 6 4 t ) v e c t o r B [ t i d ] ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( sum − modulus ∗ ( sum / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
89
}
re turn ;
}
A.1.2 Addition of a DoubleCRT object and a constant
Listing A.2: Addition Kernel for a DoubleCRT and a constant
g l o b a l void vectorAddMod ( long ∗ vec to r A ,
long width ,
long sub wid th ,
long o f f s e t ) {
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compute sum
u i n t 6 4 t sum =
( u i n t 6 4 t ) v e c t o r A [ t i d ] +
( u i n t 6 4 t ) v e c t o r n s [ t i d / s u b w i d t h ] ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( sum − modulus ∗ ( sum / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
}
re turn ;
}
A.2 Subtraction
A.2.1 Subtraction of two DoubleCRT objects
90
Listing A.3: Subtraction Kernel for Two DoubleCRT
g l o b a l void vectorSubMod ( long ∗ vec to r A ,
long ∗ v e c t o r B ,
long width ,
long sub wid th ,
long o f f s e t ) {
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compute d i f f e r e n c e
/ / add modulus t o en su r e r e s u l t i s > 0 ,
/ / w i l l be removed when mod by modulus
u i n t 6 4 t d i f f =
( u i n t 6 4 t ) v e c t o r A [ t i d ] +
modulus −
( u i n t 6 4 t ) v e c t o r B [ t i d ] ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( d i f f − modulus ∗ ( d i f f / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
}
re turn ;
}
A.2.2 Subtraction of a DoubleCRT object and a constant
Listing A.4: Subtraction Kernel for a DoubleCRT and a constant
g l o b a l void vectorSubMod ( long ∗ vec to r A ,
long width ,
91
long sub wid th ,
long o f f s e t ) {
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compuate d i f f e r e n c e
/ / add modulus t o en su r e r e s u l t i s > 0 ,
/ / w i l l be removed when mod by modulus
u i n t 6 4 t d i f f =
( u i n t 6 4 t ) v e c t o r A [ t i d ] +
modulus −
( u i n t 6 4 t ) v e c t o r n s [ t i d / s u b w i d t h ] ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( d i f f − modulus ∗ ( d i f f / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
}
re turn ;
}
A.3 Multiplication
A.3.1 Multiplication of two DoubleCRT objects
Listing A.5: Multiplication Kernel for Two DoubleCRT
g l o b a l void vectorMultMod ( long ∗ vec to r A ,
long ∗ v e c t o r B ,
long width ,
long sub wid th ,
long o f f s e t ) {
92
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compute i n t e rm e d i a t e v a l u e s
u i n t 6 4 t a1 =
( u i n t 6 4 t ) v e c t o r A [ t i d ] / 4294967296;
u i n t 6 4 t a2 =
( u i n t 6 4 t ) v e c t o r A [ t i d ] − 4294967296 ∗ a1 ;
u i n t 6 4 t b1 =
( u i n t 6 4 t ) v e c t o r B [ t i d ] / 4294967296;
u i n t 6 4 t b2 =
( u i n t 6 4 t ) v e c t o r B [ t i d ] − 4294967296 ∗ b1 ;
/ / compute z0 , per fo rm modulus
u i n t 6 4 t z0 = a2 ∗ b2 ;
z0 = z0 − modulus ∗ ( z0 / modulus ) ;
/ / compute p a r t o f z1
u i n t 6 4 t p12 = a1 ∗ b2 ;
/ / compute p a r t o f z1
u i n t 6 4 t p21 = a2 ∗ b1 ;
/ / compute z1 , per fo rm modulus
u i n t 6 4 t z1 = p12 + p21 ;
z1 = z1 − modulus ∗ ( z1 / modulus ) ;
/ / compute z2
u i n t 6 4 t z2 = a1 ∗ b1 ;
/ / l oop t o g e t z2 ∗2ˆ32 and z1 ∗2ˆ32
i n t i ;
f o r ( i =0 ; i <32; i ++) {
z1 = z1 ∗ 2 ;
i f ( z1 >= modulus ) {
93
z1 =
z1 − modulus ∗
( z1 / modulus ) ;
}
z2 = z2 ∗ 2 ;
i f ( z2 >= modulus ) {
z2 =
z2 − modulus ∗
( z2 / modulus ) ;
}
}
/ / l oop t o g e t z2 ∗2ˆ32∗2ˆ32= z2 ∗2ˆ64
f o r ( i = i ; i <64; i ++) {
z2 = z2 ∗ 2 ;
i f ( z2 >= modulus ) {
z2 =
z2 − modulus ∗
( z2 / modulus ) ;
}
}
/ / compute f i n a l v a l u e
u i n t 6 4 t z = z0 + z1 + z2 ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( z − modulus ∗ ( z / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
}
re turn ;
}
A.3.2 Multiplication of a DoubleCRT object and a constant
94
Listing A.6: Multiplication Kernel for a DoubleCRT and a constant
g l o b a l void vectorMultMod ( long ∗ vec to r A ,
long width ,
long sub wid th ,
long o f f s e t ) {
long t i d ;
t i d = o f f s e t + b l o c k I d x . x ∗ blockDim . x + t h r e a d I d x . x ;
whi le ( ( t i d−o f f s e t ) < wid th ) {
/ / e x t r a c t modulus from ar ray
u i n t 6 4 t modulus =
( u i n t 6 4 t ) v e c t o r m o d u l i [ t i d / s u b w i d t h ] ;
/ / compute i n t e rm e d i a t e v a l u e s
u i n t 6 4 t a1 =
( u i n t 6 4 t ) v e c t o r A [ t i d ] / 4294967296;
u i n t 6 4 t a2 =
( u i n t 6 4 t ) v e c t o r A [ t i d ] −
4294967296 ∗ a1 ;
u i n t 6 4 t b1 =
( u i n t 6 4 t ) v e c t o r n s [ t i d / s u b w i d t h ] /
4294967296;
u i n t 6 4 t b2 =
( u i n t 6 4 t ) v e c t o r n s [ t i d / s u b w i d t h ] −
4294967296 ∗ b1 ;
/ / compute z0 , per fo rm modulus
u i n t 6 4 t z0 = a2 ∗ b2 ;
z0 = z0 − modulus ∗ ( z0 / modulus ) ;
/ / compute p a r t o f z1
u i n t 6 4 t p12 = a1 ∗ b2 ;
/ / compute p a r t o f z1
u i n t 6 4 t p21 = a2 ∗ b1 ;
/ / compute z1 , per fo rm modulus
u i n t 6 4 t z1 = p12 + p21 ;
95
z1 = z1 − modulus ∗ ( z1 / modulus ) ;
/ / compute z2
u i n t 6 4 t z2 = a1 ∗ b1 ;
/ / l oop t o g e t z2 ∗2ˆ32 and z1 ∗2ˆ32
i n t i ;
f o r ( i =0 ; i <32; i ++) {
z1 = z1 ∗ 2 ;
i f ( z1 > modulus ) {
z1 = z1 − modulus ∗
( z1 / modulus ) ;
}
z2 = z2 ∗ 2 ;
i f ( z2 > modulus ) {
z2 = z2 − modulus ∗
( z2 / modulus ) ;
}
}
/ / l oop t o g e t z2 ∗2ˆ32∗2ˆ32= z2 ∗2ˆ64
f o r ( i = i ; i <64; i ++) {
z2 = z2 ∗ 2 ;
i f ( z2 > modulus ) {
z2 = z2 − modulus ∗
( z2 / modulus ) ;
}
}
/ / compute f i n a l v a l u e
u i n t 6 4 t z = z0 + z1 + z2 ;
/ / pe r fo rm modulus ope ra t i o n , s t o r e r e s u l t
v e c t o r A [ t i d ] =
( long ) ( z − modulus ∗ ( z / modulus ) ) ;
t i d += blockDim . x ∗ gridDim . x ;
}
96
re turn ;
}
97
APPENDIX B
VECTOR MANAGEMENT
B.1 Device Vector Management
Listing B.1: Device Vector Management
void i n i t v e c t o r ( long ∗∗ v e c t o r ,
long s i z e ,
long num elements ,
long ∗ l e n g t h ) {
/ / pe r fo rm i n i t i a l a l l o c a t i o n
i f (∗ v e c t o r == NULL) {
i f ( c u d a S u c c e s s !=
cudaMal loc ( ( void ∗∗ ) v e c t o r , s i z e ) )
{
GPU error ( F ILE ,
LINE ,
c u d a G e t L a s t E r r o r ( ) ) ;
}
∗ l e n g t h = num elements ;
/ / r e s i z e i f v e c t o r i s t oo sma l l
} e l s e i f (∗ l e n g t h < num elements ) {
c u d a F r e e (∗ v e c t o r ) ;
i f ( c u d a S u c c e s s !=
cudaMal loc ( ( void ∗∗ ) v e c t o r , s i z e ) )
{
98
GPU error ( F ILE ,
LINE ,
c u d a G e t L a s t E r r o r ( ) ) ;
}
∗ l e n g t h = num elements ;
}
}
B.2 Compute Node Buffer Management
Listing B.2: Compute Node Buffer Management
void i n i t v e c t o r ( long ∗∗ v e c t o r ,
long ∗ v e c t o r l e n g t h ,
long n e e d l e n g t h ) {
/ / pe r fo rm t h e i n i t i a l a l l o c a t i o n
i f (∗ v e c t o r == NULL) {
∗ v e c t o r =
( long ∗ ) ma l l oc ( s i z e o f ( long ) ∗
n e e d l e n g t h ) ;
∗ v e c t o r l e n g t h = n e e d l e n g t h ;
/ / r e s i z e i f b u f f e r i s t oo sma l l
} e l s e i f (∗ v e c t o r l e n g t h < n e e d l e n g t h ) {
∗ v e c t o r =
( long ∗ ) r e a l l o c ( v e c t o r ,
s i z e o f ( long ) ∗ n e e d l e n g t h ) ;
∗ v e c t o r l e n g t h = n e e d l e n g t h ;
}
}
99
APPENDIX C
CONCURRENCY MANAGEMENT
C.1 Device Stream Management
Listing C.1: Device Stream Management
void c r e a t e s t r e a m s ( long n e e d n u m s t r e a m s ) {
f o r ( long i = num st reams ;
i<n e e d n u m s t r e a m s ; i ++) {
i f ( c u d a S u c c e s s !=
c u d a S t r e a m C r e a t e (& s t r e a m [ i ] ) )
{
GPU error ( F ILE ,
LINE ,
c u d a G e t L a s t E r r o r ( ) ) ;
}
}
num st reams = n e e d n u m s t r e a m s ;
}
void i n i t s t r e a m s ( long n e e d n u m s t r e a m s ) {
i f ( num st reams == 0) {
s t r e a m = ( c u d a S t r e a m t ∗ )
ma l l oc ( n e e d n u m s t r e a m s
∗ s i z e o f ( c u d a S t r e a m t ) ) ;
100
c r e a t e s t r e a m s ( n e e d n u m s t r e a m s ) ;
} e l s e i f ( num st reams <
n e e d n u m s t r e a m s ) {
s t r e a m =
( c u d a S t r e a m t ∗ ) r e a l l o c ( s t ream ,
n e e d n u m s t r e a m s ∗
s i z e o f ( c u d a S t r e a m t ) ) ;
c r e a t e s t r e a m s ( n e e d n u m s t r e a m s ) ;
}
}
C.2 Synchronization Management
Listing C.2: Synchronization Management
void sync ( ) {
/ / u n t i l t h e queue i s empty
whi le ( ! r e q u e s t q u e u e . empty ( ) ) {
/ / g e t t h e f i r s t r e q u e s t
/ / ou t o f t h e queue
MPI : : Reques t r e q u e s t =
r e q u e s t q u e u e . f r o n t ( ) ;
r e q u e s t q u e u e . pop ( ) ;
/ / i f u n f i n i s h e d , pu t back
i f ( ! r e q u e s t . T e s t ( ) ) {
r e q u e s t q u e u e . push ( r e q u e s t ) ;
}
}
}
101
