Parallel and in-process compilation of individuals for genetic
  programming on GPU by Ayral, Hakan & Albayrak, Songül
Parallel and in-process compilation of individuals
for genetic programming on GPU
Hakan Ayral
hayral@gmail.com
Songu¨l Albayrak
songul@ce.yildiz.edu.tr
April 2017
Abstract
Three approaches to implement genetic programming on GPU hard-
ware are compilation, interpretation and direct generation of machine
code. The compiled approach is known to have a prohibitive overhead
compared to other two.
This paper investigates methods to accelerate compilation of individu-
als for genetic programming on GPU hardware. We apply in-process com-
pilation to minimize the compilation overhead at each generation; and we
investigate ways to parallelize in-process compilation. In-process compila-
tion doesn’t lend itself to trivial parallelization with threads; we propose a
multiprocess parallelization using memory sharing and operating systems
interprocess communication primitives. With parallelized compilation we
achieve further reductions on compilation overhead. Another contribution
of this work is the code framework we built in C# for the experiments.
The framework makes it possible to build arbitrary grammatical genetic
programming experiments that run on GPU with minimal extra coding
effort, and is available as open source.
Introduction
Genetic programming is an evolutionary computation technique, where the ob-
jective is to find a program (i.e. a simple expression, a sequence of statements,
or a full-scale function) that satisfy a behavioral specification expressed as test
cases along with expected results. Grammatical genetic programming is a sub-
field of genetic programming, where the search space is restricted to a language
defined as a BNF grammar, thus ensuring all individuals to be syntactically
valid.
Processing power provided by graphic processing units (GPUs) make them
an attractive platform for evolutionary computation due to the inherently paral-
lelizable nature of the latter. First genetic programming implementations shown
to run on GPUs were [2] and [5].
Just like in the CPU case, genetic programming on GPU requires the code
represented by individuals to be rendered to an executable form; this can be
1
ar
X
iv
:1
70
5.
07
49
2v
1 
 [c
s.N
E]
  2
1 M
ay
 20
17
achieved by compilation to an executable binary object, by conversion to an
intermediate representation of a custom interpreter developed to run on GPU,
or by directly generating machine-code for the GPU architecture. Compilation
of individuals’ codes for GPU is known to have a prohibitive overhead that is
hard to offset with the gains from the GPU acceleration.
Compiled approach for genetic programming on GPU is especially impor-
tant for grammatical genetic programming; the representation of individuals
for linear and cartesian genetic programming are inherently suitable for sim-
ple interpreters and circuit simulators implementable on a GPU. On the other
hand grammatical genetic programming aims to make higher level constructs
and structures representable, using individuals that represent strings of tokens
belonging to a language defined by a grammar; unfortunately executing such
a representation sooner or later requires some form of compilation or complex
interpretation.
In this paper we first present three benchmark problems we implemented
to measure compilation times with. We use grammatical genetic programming
for the experiments, therefore we define the benchmark problems with their
grammars, test cases and fitness functions.
Then we set a baseline by measuring the compilation time of individuals for
those three problems, using the conventional CUDA compiler Nvcc. Afterwards
we measure the speedup obtained by the in-process compilation using the same
benchmark problem setups. We proceed by presenting the obstacles encountered
on parallelization of in-process compilation. Finally we propose a parallelization
scheme for in-process compilation, and measure the extra speedup achieved.
Prior Work
[6] deals with the compilation overhead of individuals for genetic programming
on GPU using CUDA. Article proposes a distributed compilation scheme where
a cluster of around 16 computers compile different individuals in parallel; and
states the need for large number of fitness cases to offset the compilation over-
head. It correctly predicts that this mismatch will get worse with increasing
number of cores on GPUs, but also states that ”a large number of classic bench-
mark GP problems fit into this category”. Based on figure 5 of the article it can
be computed that for a population size of 256, authors required 25 ms/individual
in total1.
[9] presents first use of grammatical genetic programming on the GPU, ap-
plied to a string matching problem to improve gzip compression; with a grammar
constructed from fragments of an existing string matching CUDA code. Based
on figure 11 of the accompanying technical report[10] a population of 1000 indi-
1This number includes network traffic, XO, mutation and processing time on GPU, in
addition to compilation times. In our case the difference between compilation time and total
time has constantly been at sub-millisecond level per population on all problems; thus for
comparison purposes compile times we present can also be taken as total time with an error
margin of 1ms/pop.size
2
viduals (10 kernels of 100 individuals each) takes around 50 seconds to compile
using nvcc from CUDA v2.3 SDK, which puts the average compilation time to
approximately 50 ms/individual.
In [11] an overview of genetic programming on GPU hardware is provided,
along with a brief presentation and comparison of compiled and interpreted
approaches. As part of the comparison it underlines the trade-off between the
speed of compiled code versus the overhead of compilation, and states that
the command line CUDA compiler was especially slow, hence why interpreted
approach is usually preferred.
[15] investigate the acceleration of grammatical evolution by use of GPUs, by
considering performance impact of different design decisions like thread/block
granularity, different types of memory on GPU, host-device memory transac-
tions. As part of the article compilation to PTX form and loading to GPU with
JIT compilation on driver level, is compared with directly compiling to cubin
object and loading to GPU without further JIT compilation. For a kernel con-
taining 90 individuals takes 540ms to compile to CUBIN with sub-millisecond
upload time to GPU, vs 450ms for compilation to PTX and 80ms for JIT com-
pilation and upload to GPU using nvcc compiler from CUDA v3.2 SDK. Thus
PTX+JIT case which is the faster of the two achieves average compilation time
of 5.88 ms/individual.
[12] proposes an approach for improving compilation times of individuals for
genetic programming on GPU, where common statements on similar locations
are aligned as much as possible across individuals. After alignment individuals
with overlaps are merged to common kernels such that aligned statements be-
come a single statement, and diverging statements are enclosed with conditionals
to make them part of the code path only if the value of individual ID parameter
matches an individual having that divergent statements. Authors state that
in exchange for faster compilation times, they get slightly slower GPU runtime
with merged kernels as all individuals need to evaluate every condition at the
entry of each divergent code block coming from different individuals. In re-
sults it is stated that for individuals with 300 instructions, compile time is 347
ms/individual if it’s unaligned, and 72 ms/individual if it’s aligned (time for
alignment itself not included) with nvcc compiler from CUDA v3.2 SDK.
[3] provides a comparison of compilation, interpretation and direct genera-
tion of machine code methods for genetic programming on GPUs. Five bench-
mark problems consisting of Mexican Hat and Salutowicz regressions, Mackey-
Glass time series forecast, Sobel Filter and 20-bit Multiplexer are used to mea-
sure the comparative speed of the three mentioned methods. It is stated that
compilation method uses nvcc compiler from CUDA V5.5 SDK. Compilation
time breakdown is only provided for Mexican Hat regression benchmark on Ta-
ble 6, where it is stated that total nvcc compilation time took 135,027 seconds
and total JIT compilation took 106,458 seconds. Table 5 states that Mexican
Hat problem uses 400K generations and a population size of 36. Therefore we
can say that an average compilation time of (135,027+106,458)/36×400,000 = 16.76
ms/individual is achieved.
3
Implemented Problems for Measurement
We implemented three problems as benchmark to compare compilation speed.
They consist of a general program synthesis problem, Keijzer-6 as a regression
problem [8], and 5-bit Multiplier as a multi output boolean problem . The latter
two are included in the ”Alternatives to blacklisted problems” table on [16].
We use grammatical genetic programming as our representation and pheno-
type production method; therefore all problems are defined with a BNF gram-
mar that defines a search space of syntactically valid programs, along with some
test cases and a fitness function specific to the problem. For all three problems,
a genotype which is a list of (initially random) integers derives to a phenotype
which is a valid CUDA C expression, or code block in form of a list of state-
ments. All individuals are appended and prepended with some initialization and
finalization code which serves to setup up the input state and write the output
to GPU memory afterwards. See Appendix for BNF Grammars and codes used
to surround the individuals.
Search Problem
Search Problem is designed to evolve a function which can identify whether a
search value is present in an integer list, and return its position if present or
return -1 otherwise.
We first proposed this problem as a general program synthesis benchmark
in [1]. The grammar for the problem is inspired by [14]; we designed it to be
a subproblem of the more general integer sort problem case along with some
others. It also bears some similarity to problems presented in [7] based on, the
generality of its usecase, combined with simplicity of its implementation.
Test cases consist of unordered lists of random integers in the range [0, 50],
and list lengths vary between 3 and 20. Test cases are randomly generated
but half of them are ensured to contain the value searched, and others ensured
not to contain. We employed a binary fitness function, which returns 1 if the
returned result is correct (position of searched value or -1 if it’s not present on
list) or 0 if it’s not correct; hence the fitness of an individual is the sum of its
fitnesses over all test cases, which evolutionary engine tries to maximize.
Keijzer-6
Keijzer-6 function, introduced in [8], is the function K6(x) =
∑x
n=1
1
n which
maps a single integer parameter to the partial sum of harmonic series with
number of terms indicated by its parameter. Regression of Keijzer-6 function
is one of the recommended alternatives to replace simpler symbolic regression
problems like quartic polynomial [16].
For this problem we used a modified version of the grammar given in [13],
and [4], with the only modification of increasing constant and variable token
ratio as the expression nesting gets deeper. We used the root mean squared
error as fitness function which is the accepted practice for this problem.
4
5-bit multiplier
5-bit multiplier problem consists of finding a boolean relation that takes 10
binary inputs to 10 binary outputs, where two groups of 5 inputs each represent
an integer up to 25−1 in binary, and the output represents a single integer up to
210−1, such that the output is the multiplication of the two input numbers. This
problem is generally attacked as 10 independent binary regression problems,
with each bit of the output is separately evolved as a circuit or boolean function.
It’s easy to show that the number of n-bit input m-bit output binary relations
are 2m(2
n), which grows super-exponentially. Multiple output multiplier is the
recommended alternative to Multiplexer and Parity problems in [16]
We transfer input to and output from GPU with bits packed as a single 32bit
integer; hence there is a code preamble before first individual to unpack the input
bits, and a post-amble after each individual to pack the 10 bits computed by
evolved expressions as an integer.
The fitness function for 5-bit multiplier computes the number of bits different
between the individual’s response and correct answer, by computing the pop
count of these two XORed.
Development and Experiment Setup
Hardware Platform
All experiments have been conducted on a dual Xeon E5-2670 (8 physical 16
logical cores per CPU, 32 cores in total) platform running at 2.6Ghz equipped
with 60GB RAM, along with dual SSD storage and four NVidia GRID K520
GPUs. Each GPU itself consists of 1536 cores spread through 8 multiprocessors
running at 800Mhz, along with 4GB GDDR5 RAM 2 and is able to sustain 2
teraflops of single precision operations (in total 6144 cores and 16GB GDDR5
VRAM which can theoretically sustain 8 teraflops single precision computation
assuming no other bottlenecks). GPUs are accessed for computation through
NVidia CUDA v8 API and libraries, running on top of Windows Server 2012
R2 operating system.
Development Environment
Codes related to grammar generation, parsing, derivation, genetic program-
ming, evolution, fitness computation and GPU access has been implemented
in C#, using managedCuda 3 for CUDA API bindings and NVRTC interface,
along with CUDAfy.NET 4 for interfacing to NVCC command line compiler.
The grammars for the problems has been prepared such that the languages de-
fined are valid subsets of CUDA C language specialized towards the respective
problems.
2see validation of hardware used at experiment: http://www.techpowerup.com/gpuz/details/7u5xd/
3https://kunzmi.github.io/managedCuda/
4https://cudafy.codeplex.com/
5
Experiment Parameters
We ran each experiment with population sizes starting from 20 individual per
population, going up to 300 with increments of 20. As the subject of interest is
compilation times and not fitness, we measured the following three parameters
to evaluate compilation speed:
(i) ptx : Cuda source code to Ptx compilation time per individual
(ii) jit : Ptx to Cubin object compilation time per individual
(iii) other : All remaining operations a GP cycle requires (i.e compiled indi-
viduals running on GPU, downloading produced results, computing fitness
values, evolutionary selection, cross over, mutation, etc.)
The value of other is measured to be always at sub-millisecond level, in all
experiments, all problems and for all population sizes. Therefore it does not
appear on plots. For all practical purposes ptx + jit can be considered as the
total time cost of a complete cycle for a generation, with an error margin of
1ms
pop.size .
Each data point on plots corresponds to the average of one of those measure-
ments for the corresponding (populationsize,measurementtype, experiment)
triple. Each average is computed over the measurement values obtained for the
first 10 generations of 15 different populations for given size (thus effectively the
compile times of 150 generations averaged). The reason for not using 150 gen-
erations of a single population directly is that a population gains bias towards
to a certain type of individuals after certain number of generations, and stops
representing the inherent unbiased distribution of grammar.
The number of test cases used is dependent to the nature of problem; on
the other hand as each test case is run as a GPU thread, it is desirable that the
number of test cases are a multiple of 32 on any problem, as finest granularity
for task scheduling on modern GPUs is a group of 32 threads which is called
a Warp. For non multiple of 32 test cases, GPU transparently rounds up the
number to nearest multiple of 32 and allocate cores accordingly, with some
threads from the last warp work on cores with output disabled. The number
of test cases we used during experiments were 32 for Search Problem, 64 for
regression of Keijzer-6 function and 1024 (= 2(5+5)) for 5-bit Binary Multiplier
Problem. For all experiments both mutation and crossover rate was set to 0.7;
these rates do not affect the compilation times.
Experiment Results
Conventional Compilation as Baseline
NVCC is the default compiler of CUDA platform, it is distributed as a command
line application. In addition to compilation of cuda C source codes, it performs
tasks such as the separation of source code as host code and device code, calling
6
population size (#individuals)
0 50 100 150 200 250 300 350
tim
e 
(m
s)
0
10
20
30
40
50
60
70
80
90
Search Problem
Keijzer-6 Regression
5-bit Multiplier
(a) Per individual compile time
population size (#individuals)
0 50 100 150 200 250 300 350
tim
e 
(se
c)
1
2
3
4
5
6
7
Search Problem
Keijzer-6 Regression
5-bit Multiplier
(b) Total compile time
Figure 1: Nvcc compilation times by population size.
the underlying host compiler (GCC or Visual C compiler) for host part of source
code, and linking compiled host and device object files.
Fib1(a) shows that compilation times level out at 11.2 ms/individual for
Search Problem, at 7.62 ms/individual for Keijzer-6 regression, and at 17.2
ms/individual for 5-bit multiplier problem. It can be seen on Fig.1(b) that,
even though not obvious, the total compilation time does not increase linearly,
which is most observable on trace of 5-bit multiplier problem. As Nvcc is a
separate process, it isn’t possible to measure the distribution of compilation
time between source to ptx, ptx to cubin, and all other setup work (i.e. process
launch overhead, disk I/O); therefore it is not possible to pinpoint the source of
nonlinearity on total compilation time.
The need for successive invocations of Nvcc application, and all data trans-
fers being handled over disk files are the main drawbacks of Nvcc use in a
real time5 context, which is the case in genetic programming. Eventhough the
repeated creation and teardown of NVCC process most probably guarantees
that the application stays on disk cache, this still prevents it to stay cached on
processor L1/L2 caches.
In-process Compilation
NVRTC is a runtime compilation library for CUDA C, it was first released as
part of v7 of CUDA platform in 2015. NVRTC accepts CUDA source code and
compiles it to PTX in-memory. The PTX string generated by NVRTC can be
further compiled to device dependent CUBIN object file and loaded with CUDA
Driver API still without persisting it to a disk file. This provides optimizations
and performance not possible in off-line static compilation.
Without NVRTC, for each compilation a separate process needs to be spawned
to execute nvcc at runtime. This has significant overhead drawback, NVRTC
5not as in hard real time, but as prolonged, successive and throughput sensitive use
7
addresses these issues by providing a library interface that eliminates overhead
of spawning separate processes, and extra disk I/O.
population size (#individuals)
0 50 100 150 200 250 300
co
m
pi
le
 ti
m
e 
pe
r i
nd
ivi
du
al
 (m
s)
0
10
20
30
40
50
60
70
80
out of process compilation
inprocess compilation
(a) Per individual
population size (#individuals)
0 50 100 150 200 250 300
po
pu
la
tio
n 
co
m
pi
le
 ti
m
e 
(se
c)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
out of process compilation
inprocess compilation
(b) Total
Figure 2: In-process and out of process compilation times by population size,
for Search Problem
population size (#individuals)
0 50 100 150 200 250 300
co
m
pi
le
 ti
m
e 
pe
r i
nd
ivi
du
al
 (m
s)
0
10
20
30
40
50
60
70
80
out of process compilation
inprocess compilation
(a) Per individual
population size (#individuals)
0 50 100 150 200 250 300
po
pu
la
tio
n 
co
m
pi
le
 ti
m
e 
(se
c)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
out of process compilation
inprocess compilation
(b) Total
Figure 3: In-process and out of process compilation times by population size,
for Keijzer-6 Regression
On figures 2,3 and 4 it can be seen that in-process compilation of individuals
not only provides reduced compilation times for all problems on all population
sizes, it also allows to reach asymptotically optimal per individual compilation
time with much smaller populations. The fastest compilation times achieved
with in-process compilation is 4.14 ms/individual for Keijzer-6 regression (at
300 individuals per population), 10.88 ms/individual for 5-bit multiplier problem
(at 100 individuals per population6), and 6.89 ms/individual for Search Problem
(at 280 individuals per population7). The total compilation time speed ups are
6compilation speed at 300 individuals per population is 13.29 ms/individual
7compilation speed at 300 individuals per population is 7.76 ms/individual
8
population size (#individuals)
0 50 100 150 200 250 300
co
m
pi
le
 ti
m
e 
pe
r i
nd
ivi
du
al
 (m
s)
0
10
20
30
40
50
60
70
80
out of process compilation
inprocess compilation
(a) Per individual
population size (#individuals)
0 50 100 150 200 250 300
po
pu
la
tio
n 
co
m
pi
le
 ti
m
e 
(se
c)
0
1
2
3
4
5
6
out of process compilation
inprocess compilation
(b) Total
Figure 4: In-process and out of process compilation times by population size,
for 5-bit Multiplier
measured to be in the order of 261% to 176% for the K6 regression problem,
288% to 124% for the 5-bit multiplier problem, and 272% to 143% for the Search
Problem, depending on population size (see Fig.5).
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
0.5
1
1.5
2
2.5
3
Search Problem speedup ratio
K6 Regression speedup ratio
5-bit Multiplier speedup ratio
Figure 5: Compile time speedup ratios between conventional and in-process
compilation by problem
Parallelizing In-process Compilation
Infeasibility of parallelization with threads
A first approach to parallelize in-process compilation, comes to mind as to
partition the individuals and spawn multiple threads that will compile each
partition in parallel through NVRTC library. Unfortunately it turns out that
NVRTC library is not designed for multi-threaded use; we noticed that when
9
multiple compilation calls are made from different threads at the same time, the
execution is automatically serialized.
Stack trace in Fig.6 shows nvrtc64 80.dll calling OS kernel’s EnterCriti-
calSection function to block for exclusive execution of a code block, and gets
unblocked by another thread which also runs a block from same library, 853ms
later via the release of the related lock. The pattern of green blocks on three
threads in addition to main thread in Fig.6 shows that calls are perfectly seri-
alized one after another, despite being called at the same time which is hinted
by the red synchronization blocks preceding them.
Figure 6: NVRTC library serializes calls from multiple threads
Although NVRTC compiles CUDA source to PTX with a single call, the
presence of compiler options setup function which affects the following compi-
lation call, and use of critical sections at function entries, show that apparently
this is a stateful API. Furthermore, unlike CUDA APIs’ design, mentioned state
is most likely not stored in thread local storage (TLS), but stored on the pri-
vate heap of the dynamic loading library, making it impossible for us to trivially
parallelize this closed source library using threads, as moving the kept state to
TLS requires source level modifications.
Parallelization with daemon processes
Therefore as a second approach we implemented a daemon process which stays
resident. It is launched from command line with a unique ID as command
line parameter to allow multiple instances. Instances of daemon is launched as
many times as the wanted level of parallelism, and each instance identifies itself
with the ID received as parameter. Each launched process register two named
10
synchronization events with the operating system, for signaling the state tran-
sitions of a simple state machine consisting of {starting, available, processing}
states which represent the state of that instance. Main process also has copies
of same state machines for each instance to track the states of daemons. Thus
both processes (main and daemon) keep a consistent view of the mirrored state
machine by monitoring the named events which allows state transitions to be
performed in lock step. State transition can be initiated by both processes,
specifically (starting → available) and (processing → available) is triggered
by the daemon, and (available→ processing) is triggered by the main process.
Main Process Compilation Daemon OS
create synchronization events %ID%+”1” and %ID%+”2”
launch process with command line parameter %ID%
wait for event %ID%+”1”
create process
create named memory map ”%MMAP%+%UID%”
create view to memory map
open synchronization event %UID%+”1” and %UID%+”2”
signal event %UID%+”1”
wait for event %UID%+”2”
unblock as event %UID%+”1” signaled
Figure 7: Sequence Diagram for creation of a compilation daemon process and
related interprocess communication primitives
The communication between the main process and compilation daemons are
handled via shared views to memory maps. Each daemon register a named
memory map and create a memory view, onto which main process also creates
a view to after the daemon signals state transition from starting to available.
(see Fig.7) CUDA source is passed through this shared memory, and compiled
device dependent CUBIN object file is also returned through the same. To
signal the state transition (starting → available) daemon process signals the
first event and starts waiting for the second event at the same time. Once a
daemon leaves the starting state, it never returns back to it.
When the main process generate a new population to be compiled it parti-
tions the individuals in a balanced way, such that the difference of number of
individuals between any pair of partitions is never more than one. Once the
individuals are partitioned, the generated CUDA codes for each partition are
passed to the daemon processes. Each daemon waits in the blocked state till
main process wakes that specific daemon for a new batch of source to compile
by signaling the second named event of that process (see Fig.8). Main process
signals all daemons asynchronously to start compiling; then starts waiting for
the completion of daemon processes’ work. To prevent the UI thread of main
process getting blocked too, main process maintains a separate thread for each
11
daemon process it communicates with, therefore while waiting for daemon pro-
cesses to finish their jobs only those threads of main process are blocked. Main
process signaling the second event and daemon process unblocking as a result,
corresponds to the state transition (available→ processing).
Main Process Compilation Daemon OS
write CUDA code to shared memory
signal event %ID%+”2”
wait for event %ID%+”1”
unblock as event %ID%+”2” is signaled
read CUDA code from shared memory,
compile CUDA code to PTX with NVRTC,
compile PTX to CUBIN with Driver API,
write CUBIN object to shared memory
signal event %ID%+”1”
wait for event %ID%+”2”
unblock as event %ID%+”1” is signaled
read CUBIN object from shared memory
Figure 8: Sequence Diagram for compilation on daemon process and related
interprocess communication
When a daemon process arrives to processing state, it reads the CUDA
source code from the shared view of the memory map related to its ID, and
compiles the code using NVRTC library.
Once a daemon finishes compiling and writes the Cubin object to shared
memory, it signals the first event to unblock the related thread in main process
and starts to wait for the second event once again. This signaling, blocking pair
corresponds to the state transition (processing → available).
Cost of Parallelization
The parallelization approach we propose is virtually overhead free when com-
pared to a hypothetical parallelization scenario using threads. As the daemon
processes are already resident and waiting in the memory along with the loaded
NVRTC library, the overhead of both parallelization approaches is limited to
the time cost of memory moves from/to shared memory and synchronization
by named events8. The only difference between the two is, in a context switch
between threads of same process, processor keeps the Translation Look Aside
Buffer (TLB), but in case of a context switch to another process TLB is flushed
as processor transitions to a new virtual address space; we conjecture that the
impact would be negligible.
8on Windows operating system named events is the fastest IPC primitive, upon which all
others (i.e. mutex, semaphore) are implemented
12
Table 1: Compilation Times by Compilation Methods for Search Problem with
300 individuals
Compilation Time Speedup ratio
Compilation In-process Nvcc
Method Per individual Total compilation compilation
Nvcc 11.20 ms 3.36 sec - 1.00
In-process 7.76 ms 2.33 sec 1.00 1.44
2 daemons 3.81 ms 1.14 sec 2.04 2.93
4 daemons 2.53 ms 0.76 sec 3.07 4.41
6 daemons 2.23 ms 0.67 sec 3.48 5.01
8 daemons 2.13 ms 0.64 sec 3.65 5.26
Table 2: Compilation Times by Compilation Methods for Keijzer-6 Regression
with 300 individuals
Compilation Time Speedup ratio
Compilation In-process Nvcc
Method Per individual Total compilation compilation
Nvcc 7.63 ms 2.29 sec - 1.00
In-process 4.14 ms 1.24 sec 1.00 1.83
2 daemons 2.92 ms 0.88 sec 1.42 2.60
4 daemons 2.45 ms 0.73 sec 1.69 3.10
6 daemons 2.20 ms 0.66 sec 1.88 3.45
8 daemons 2.25 ms 0.67 sec 1.84 3.37
About the memory cost, all modern operating systems recognize when an
executable binary or shared library gets loaded multiple times; OS keeps a single
copy of the related memory pages on physical memory, and separately maps
those to virtual address spaces of each process using those. This not only saves
physical RAM, but also allows better space locality for L2/L3 processor caches.
Hence memory consumption by multiple instances of our daemon processes each
loading NVRTC library (nvrtc64 80.dll is almost 15MB) to their own address
space, is almost the same as the consumption of a single instance.
Speedup Achieved with Parallel Compilation
At the end of each batch of experiments main application dumps the collected
raw measurements to a file. We imported this data to Matlab filtered by ex-
periment and measurement types, and aggregated the experiment values for
each population size to produce the Tables 1,2,3, and to create the Figures
9,10,11,12,13,14.
It can be seen that parallelized in-process compilation of genetic program-
ming individuals is faster for all problems and population sizes when compared
to in-process compilation without parallelization; furthermore in-process com-
13
Table 3: Compilation Times by Compilation Methods for 5-bit Multiplier Prob-
lem with 300 individuals
Compilation Time Speedup ratio
Compilation In-process Nvcc
Method Per individual Total compilation compilation
Nvcc 17.20 ms 5.16 sec - 1.00
In-process 13.29 ms 3.99 sec 1.00 1.24
2 daemons 6.15 ms 1.85 sec 2.16 2.69
4 daemons 3.23 ms 0.97 sec 4.12 5.12
6 daemons 2.42 ms 0.73 sec 5.49 6.82
8 daemons 2.17 ms 0.65 sec 6.11 7.60
pilation without parallelization itself was shown to be faster than regular com-
mand line nvcc compilation on previous section.
Parallel compilation brought the per individual compilation time to 2.17
ms/individual for 5-bit Multiplier, to 2.20 ms/individual for Keijzer-6 regres-
sion and to 2.13 milliseconds for the Search Problem; these are almost an order
of magnitude faster than previous published results. Also we measured a com-
pilation speedup of ×3.45 for regression problem, ×5.26 for search problem,
and ×7.60 for multiplication problem, when compared to the latest Nvcc V8
compiler, without requiring any code modification, and without any runtime
performance penalty.
Notice that our experiment platform consisted of dual Xeon E5-2670 pro-
cessors running at 2.6Ghz; for compute bound tasks increase on processor fre-
quency almost directly translates to performance improvement at an equal rate9.
Therefore we can conjecture that to be able to compile a population of 300 indi-
viduals at sub-millisecond durations, the required processor frequency is around
2.6× 2.13 = 5.54Ghz10 which is currently available.
Conclusion
In this paper we present a new method to accelerate the compilation of genetic
programming individuals, in order to keep the compiled approach as a viable
option for genetic programming on gpu.
By using an in-process GPU compiler, we replaced disk file based data trans-
fer to/from the compiler with memory accesses, also we mitigated the overhead
of repeated launches and tear downs of the command line compiler. Also we
investigated ways to parallelize this method of compilation, and identified that
in-process compilation function automatically serializes concurrent calls from
different threads. We implemented a daemon process that can have multiple
9assuming all other things being equal
10once again, under assumption of all other things being equal. 2.13 is the compilation time
of Search Problem with 8 daemons
14
running instances and service another application requesting CUDA code com-
pilation. Daemon processes use the same in-line compilation method and com-
municate through operating system’s Inter Process Communication primitives.
We measured compilation times just above 2.1 ms/individual for all three
benchmark problems; and observed compilation speedups ranging from ×3.45 to
×7.60 based on problem, when compared to repeated command line compilation
with latest Nvcc v8 compiler.
All data and source code of software presented in this paper is available at
https://github.com/hayral/Parallel-and-in-process-compilation-of-individuals-for-
genetic-programming-on-GPU
Acknowledgments
Dedicated to the memory of Professor Ahmet Cokun Snmez.
First author was partially supported by Turkcell Academy.
Appendix
Search Problem
Grammar Listing
<expr> : := <expr2> <bi−op> <expr2> | <expr2>
<expr2> : := <int> | <var−read> | <var−indexed>
<var−read> : := tmp | i | OUTPUT | SEARCH
<var−indexed> : := INPUT[<var−read> % LENINPUT]
<var−write> : := tmp | OUTPUT
<bi−op> : := + | −
<int> : := 1 | 2 | (−1)
<statement> : := <assignment> | < i f> | <loop>
<statement2> : := <assignment> | < i f 2>
<statement3> : := <assignment>
<loop> : := f o r ( i =0; i #l e s s e r# LENINPUT; i++){<c−block2>}
< i f> : := i f (<cond−expr>) {<c−block2>}
< i f 2> : := i f (<cond−expr>) {<c−block3>}
<cond−expr> : := <expr> <comp−op> <expr>
<comp−op> : := #l e s s e r# | #g r e a t e r# | == | !=
<assignment> : := <var−write> = <expr>;
<c−block> : := <statements>
<c−block2> : := <statements2>
<c−block3> : := <statements3>
<statements> : := <statement>
15
| <statement><statement>
| <statement><statement><statement>
<statements2> : := <statement2>
| <statement2><statement2>
| <statement2><statement2><statement2>
<statements3> : := <statement3>
| <statement3><statement3>
| <statement3><statement3><statement3>
Listing 1: Grammar for Search Problem
Code Preamble for Whole Population
c o n s t a n t i n t INPUT [NUMTESTCASE] [ MAX TESTCASE LEN ] ;
c o n s t a n t i n t LENINPUT[NUMTESTCASE] ;
c o n s t a n t i n t SEARCH[NUMTESTCASE] ;
c o n s t a n t i n t CORRECTANSWER[NUMTESTCASE] ;
g l o b a l void createdFunc ( i n t ∗ OUTPUT)
{
i n t ∗INPUT = INPUT [ threadIdx . x ] ;
i n t LENINPUT = LENINPUT[ threadIdx . x ] ;
i n t SEARCH = SEARCH[ threadIdx . x ] ;
i n t i ;
i n t tmp ;
i n t OUTPUT;
Listing 2: Code preamble for whole population on Search Problem
Keijzer-6 Regression Problem
Grammar Listing
<e> : := <e2> + <e2> | <e2> − <e2> | <e2> ∗ <e2> | <e2> / <e2>
| s q r t f ( f a b s f (<e2>)) | s i n f (<e2>) | tanhf (<e2>)
| expf (<e2>) | l o g f ( f a b s f (<e2>)+1)
| x | x | x | x
| <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c>
<e2> : := <e3> + <e3> | <e3> − <e3> | <e3> ∗ <e3> | <e3> / <e3>
| s q r t f ( f a b s f (<e3>)) | s i n f (<e3>) | tanhf (<e3>)
| expf (<e3>) | l o g f ( f a b s f (<e3>)+1)
| x | x | x | x | x | x
| <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c>
| <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c>
<e3> : := <e3> + <e3> | <e3> − <e3> | <e3> ∗ <e3> | <e3> / <e3>
| s q r t f ( f a b s f (<e3>)) | s i n f (<e3>) | tanhf (<e3>)
| expf (<e3>) | l o g f ( f a b s f (<e3>)+1)
| x | x | x | x | x | x | x | x
| <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c>
| <c><c>.<c><c> | <c><c>.<c><c> | <c><c>.<c><c>
| <c><c>.<c><c> | <c><c>.<c><c>
16
<c> : := 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Listing 3: Grammar for Keijzer-6 Regression
5-bit Multiplier Problem
Grammar Listing
<s t a r t> : := o0=<expr>;o1=<expr>;o2=<expr>;o3=<expr>;o4=<expr>;
o5=<expr>;o6=<expr>;o7=<expr>;o8=<expr>;o9=<expr>;
<expr> : := (<expr2> <bi−op> <expr2>) | <var> | (˜ <var>)
<expr2> : := (<expr2> <bi−op> <expr2>) | <var> | (˜ <var>)
| <var> | (˜ <var>)
<var> : := a0 | a1 | a2 | a3 | a4 | b0 | b1 | b2 | b3 | b4
<bi−op> : := & | #or#
Listing 4: Grammar for 5-bit Multiplier Problem
Code Preamble for each Individual
g l o b a l void createdFunc0 ( i n t ∗ OUTPUT)
{
i n t t i d = blockIdx . x ∗blockDim . x + threadIdx . x ; ;
i n t a0 = t i d & 0x1 ;
i n t a1 = ( t i d & 0x2 ) >> 1 ;
i n t a2 = ( t i d & 0x4 ) >> 2 ;
i n t a3 = ( t i d & 0x8 ) >> 3 ;
i n t a4 = ( t i d & 0x10 ) >> 4 ;
i n t b0 = ( t i d & 0x20 ) >> 5 ;
i n t b1 = ( t i d & 0x40 ) >> 6 ;
i n t b2 = ( t i d & 0x80 ) >> 7 ;
i n t b3 = ( t i d & 0x100 ) >> 8 ;
i n t b4 = ( t i d & 0x200 ) >> 9 ;
i n t o0 , o1 , o2 , o3 , o4 , o5 , o6 , o7 , o8 , o9 ;
Listing 5: Code preamble for 5-bit Multiplier Problem
Compilation Time and Speedup Ratio Plots
17
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(m
s)
0
5
10
15
20
25
30
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Per individual compile time
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(se
c)
0
0.5
1
1.5
2
2.5
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Total compile time
Figure 9: Nvcc compilation times for Search Problem by number of servicing
resident processes
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Speedup against conventional compila-
tion
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Speedup against in-process compilation
Figure 10: Parallelization speedup on Search problem
18
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(m
s)
0
5
10
15
20
25
30
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Per individual compile time
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(se
c)
0
0.5
1
1.5
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Total compile time
Figure 11: Nvcc compilation times for Keijzer-6 regression by number of servic-
ing resident processes
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Speedup against conventional compila-
tion
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Speedup against in-process compilation
Figure 12: Parallelization speedup on Keijzer-6 regression
19
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(m
s)
0
5
10
15
20
25
30
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Per individual compile time
population size (#individuals)
0 50 100 150 200 250 300
tim
e 
(se
c)
0
0.5
1
1.5
2
2.5
3
3.5
4
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Total compile time
Figure 13: Nvcc compilation times for 5-bit Multiplier by number of servicing
resident processes
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(a) Speedup against conventional compila-
tion
population size (#individuals)
0 50 100 150 200 250 300
sp
ee
d 
up
 ra
tio
0
1
2
3
4
5
6
7
8
inprocess
2 service processes
4 service processes
6 service processes
8 service processes
(b) Speedup against in-process compilation
Figure 14: Parallelization speedup on 5-Bit multiplier
20
References
[1] Hakan Ayral and Songu¨l Albayrak. Effects of Population, Generation and
Test Case Count on Grammatical Genetic Programming for Integer Lists
(in press). in press.
[2] D M Chitty. A data parallel approach to genetic programming using pro-
grammable graphics hardware. Proc. of the Conference on Genetic and
Evolutionary Computation (GECCO), 2:1566–1573, 2007.
[3] Cleomar Pereira da Silva, Douglas Mota Dias, Cristiana Bentes, Marco
Aure´lio Cavalcanti Pacheco, and Leandro Fontoura Cupertino. Evolving
GPU machine code. The Journal of Machine Learning Research, 16(1):673–
712, 2015.
[4] David Fagan, Michael Fenton, and Michael O’Neill. Exploring Position
Independent Initialisation in Grammatical Evolution. Proceedings of 2016
IEEE Congress on Evolutionary Computation (CEC 2016), pages 5060–
5067, 2016.
[5] Simon Harding and Wolfgang Banzhaf. Fast Genetic Programming on
GPUs. In Proceedings of the 10th European Conference on Genetic Pro-
gramming, volume 4445, pages 90–101. Springer, 2007.
[6] SL Simon L SL Harding and Wolfgang Banzhaf. Distributed genetic pro-
gramming on GPUs using CUDA. In Workshop on Parallel Architectures
and Bioinspired Algorithms, pages 1–10, 2009.
[7] Thomas Helmuth and Lee Spector. General Program Synthesis Benchmark
Suite. In Proceedings of the 2015 on Genetic and Evolutionary Computation
Conference - GECCO ’15, pages 1039–1046, New York, New York, USA,
2015. ACM Press.
[8] Maarten Keijzer. Improving Symbolic Regression with Interval Arithmetic
and Linear Scaling. Genetic Programming Proceedings of EuroGP2003,
2610:70–82, 2003.
[9] W. B. Langdon and M. Harman. Evolving a CUDA kernel from an nVidia
template. IEEE Congress on Evolutionary Computation, pages 1–8, jul
2010.
[10] WB Langdon and M Harman. Evolving gzip matches Kernel from an nVidia
CUDA Template. Technical Report February, 2010.
[11] William B. Langdon. Graphics processing units and genetic programming:
an overview. Soft Computing - A Fusion of Foundations, Methodologies
and Applications, 15(8):1657–1669, mar 2011.
21
[12] Tony E. Lewis and George D. Magoulas. Identifying similarities in TMBL
programs with alignment to quicken their compilation for GPUs. Proceed-
ings of the 13th annual conference companion on Genetic and evolutionary
computation - GECCO ’11, page 447, 2011.
[13] Miguel Nicolau and Michael Fenton. Managing Repetition in Grammar-
Based Genetic Programming. Proceedings of the 2016 on Genetic and Evo-
lutionary Computation Conference - GECCO ’16, pages 765–772, 2016.
[14] M O’Neill, M Nicolau, and A Agapitos. Experiments in program synthesis
with grammatical evolution: A focus on Integer Sorting. In Evolutionary
Computation (CEC), 2014 IEEE Congress on, pages 1504–1511, 2014.
[15] Petr Pospichal, Eoin Murphy, Michael O’Neill, Josef Schwarz, and Jiri
Jaros. Acceleration of Grammatical Evolution Using Graphics Processing
Units. Proceedings of the 13th annual conference companion on Genetic
and evolutionary computation - GECCO ’11, pages 431–438, 2011.
[16] David R. White, James McDermott, Mauro Castelli, Luca Manzoni,
Brian W. Goldman, Gabriel Kronberger, Wojciech Jas´kowski, Una-May
O’Reilly, and Sean Luke. Better GP benchmarks: community survey results
and proposals. Genetic Programming and Evolvable Machines, 14(1):3–29,
dec 2012.
22
