CoreTSAR: Task Scheduling for Accelerator-aware Runtimes by Scogland, Thomas R. W. et al.
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Thomas R. W. Scogland Wu-chun Feng
Department of Computer Science, Virginia Tech,
Blacksburg, VA 24060 USA
{tom.scogland,wfeng}@vt.edu
Barry Rountree Bronis R. de Supinski
Center for Applied Scientiﬁc Computing, Lawrence
Livermore National Laboratory,
Livermore, CA 94551 USA
{rountree,bronis}@llnl.gov
Abstract
Heterogeneous supercomputers that incorporate computational ac-
celerators such as GPUs are increasingly popular due to their high
peak performance, energy efﬁciency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
ﬂexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations cur-
rently lose one of OpenMP’s best features, its ﬂexibility: typical
OpenMPapplicationscanrunonanynumberofCPUs.GPUimple-
mentations do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically schedul-
ing tasks across heterogeneous resources, and propose straightfor-
ward extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall ﬂexibility of Accelerated OpenMP.
1. Introduction
While heterogeneous, especially GPU-accelerated, large-scale sys-
tems are becoming more popular, their programming models deter
many potential users. Unlike adding more or faster CPUs, which at
least work without code changes, programs must be explicitly up-
dated to use GPUs, frequently with unfamiliar programming mod-
els and syntax. Rather than grapple with the issue, users often sim-
plyruntheirlegacyCPU-onlycodeonacceleratedresources,which
can leave a signiﬁcant portion of the computing resources idle. A
more familiar and consistent programming model that handles both
accelerators and CPUs efﬁciently would realize signiﬁcantly more
beneﬁts of heterogeneous systems.
Accelerated OpenMP [6] offers the desired programming model
with familiar OpenMP-style syntax. This syntax will facilitate the
adoption of accelerators in scientiﬁc computing, especially for
legacy OpenMP applications, which can be ported to the GPU with
native syntax. However, Accelerated OpenMP is not a panacea:
the model’s current design helps one move their computation to a
single GPU with straightforward adjustments to OpenMP source
code. The extensions do not support transparent use of more than
[Copyright notice will appear here once ’preprint’ option is removed.]
one GPU. They also do not support use of both CPU and GPU re-
sources within the same parallel region. Thus, users must manually
split work between available resources and manage the complex
rules that ensure OpenMP and Accelerated OpenMP cooperate
safely.
Our work creates a scheduling system, along with a set of pro-
posed Accelerated OpenMP clauses, that allow users to exploit
all computational resources through a consistent interface. With
our approach, a user does not have to divide their problem man-
ually across devices. Our system, the CoreTSAR (Task Scheduler
for Accelerator-aware Runtimes) library, automates the schedul-
ing, load balancing, and cross-device data management. This paper
presents our implementation of CoreTSAR as a library for use with
Accelerated OpenMP or any accelerator-aware runtime as well as
a proposal to integrate its functionality into Accelerated OpenMP.
CoreTSAR has been designed so the compiler/runtime could trans-
parently include it. Speciﬁcally we make these contributions:
• Extensions to Accelerated OpenMP to support low-overhead
stable co-scheduling of parallel loop regions across an arbitrary
number of CPUs and GPUs without code replication;
• An implementation of our extensions on top of the OpenMP
runtime which, thus, is applicable to any Accelerated OpenMP
implementation (our evaluation uses PGI Accelerator);
• An evaluation which demonstrates that our multiple GPU ex-
tensions signiﬁcantly improve performance over only using one
GPU or using CPU cores and one GPU, and that the choice of
scheduler is application-centric and, thus, portable.
The paper is composed as follows. Section 2 offers background
on GPUs, GPU programming models, and the extensions made
to OpenMP to support these architectures. Section 3 describes
the design of CoreTSAR including our task management concept,
scheduling mechanisms, and memory management. Details on our
implementationfollowinSection4.WepresentresultsinSection5.
2. Background
This section discusses GPU programming in general, related termi-
nology, and Accelerated OpenMP.
2.1 GPU Programming Models
GPUs are the most commonly available computational accelerator.
We use the terminology from NVIDIA’s CUDA architecture to de-
scribetheirdesignandcomponents.AnNVIDIATeslaC2070GPU
contains 14 multiprocessors, each of which is a 32 wide SIMD pro-
cessor. CUDA supports using these as Multiple Instruction, Multi-
ple Data (MIMD) units by serializing conﬂicting instructions.
The memory model is the most important difference between
programming a GPU and a set of CPUs. While GPUs are node-
1 2012/11/6#pragma omp parallel for \
shared(in1,in2,out,pow)
for (i=0;i < e n d ;i + + ) {
out[i] = in1[i]*in2[i];
pow[i] = pow[i]*pow[i];
}
#pragma acc region for \
copyin(in1[0:end],in2[0:end])\
copyout(out[0:end]) \
copy(pow[0:end])
/* hetero(<cond>[,<scheduler>[,<ratio>\
[,<div>][,<devices>]]]) \
partial(out,pow)*/ /* proposed extension */
for (i=0;i < e n d ;i + + ) {
out[i] = in1[i]*in2[i];
pow[i] = pow[i]*pow[i];
}
Figure 1: OpenMP (top) and Accelerated OpenMP (bottom).
local resources, they use a separate memory hierarchy that con-
sists of a small scratch memory local to each multiprocessor called
shared memory, L1 and L2 caches, and global memory, which is
memory that all multiprocessors can access. Unlike system main
memory, cache coherence of global memory is not maintained ex-
cept for atomic instructions. This split between CPU and GPU ac-
cessible resources complicates the inclusion of GPUs in traditional
shared memory programming models such as OpenMP.
2.2 OpenMP Accelerator Directives
Current options for OpenMP-like accelerator directives include
HMPP [12], OpenACC [7], OpenMP for accelerators [6], and PGI
Accelerator directives [27]. Our work refers to an OpenMP Lan-
guage Committee proposal that combines and extends these op-
tions as Accelerated OpenMP. OpenACC is directly based on that
proposal, and our statements are equally applicable to it. In this pa-
per, we use the PGI Accelerator directives implementation for our
examples and evaluation.
Unlike standard OpenMP, accelerator directives must support
distributed memory. Figure 1 presents a simple example that il-
lustrates the necessary modiﬁcations to an OpenMP program. The
OpenMP version parallelizes the loop so that all threads share the
input and output arrays. The accelerator version is similar, but
more speciﬁc about memory movement. First, instead of simply
listing the arrays as shared, each array is marked as copyin(),
copyout() or copy(), which causes the runtime to allocate mem-
ory on a GPU and copy the CPU array into the device but not
out, out but not in, or both in and out, respectively. The sub-
scripts in these clauses support array shaping using the form
array[<start>:<end>:<stride>], which deﬁnes the array’s
size (which the compiler may not be able to infer), and can request
that only a subset of the array be copied into or out of a region.
Figure 1 includes a comment that shows our proposed interface
for invoking CoreTSAR. The hetero() clause speciﬁes that the
region should use CoreTSAR for coscheduling. We discuss its ar-
guments in Section 3. We also propose another clause, partial(),
to specify that only parts of certain arrays need to be copied to exe-
cute a given loop iteration, which allows the compiler and runtime
to copy only the data that the scheduled work needs.
3. Design
First and foremost, CoreTSAR is a (co)scheduling entity that takes
in a list of related tasks and distributes them across a set of (hetero-
geneous) resources. CoreTSAR is not a programming model, nor
does it parallelize serial code. It does extend an existing program-
ming model. One could use CoreTSAR with raw CUDA and C, or
any combination of programming models, and in principle for any
combination of devices. Our case study uses Accelerated OpenMP,
speciﬁcally the PGI Accelerator implementation, to compile code
for GPUs and OpenMP to generate threaded CPU code. Our exten-
sions to Accelerated OpenMP would allow CoreTSAR to become
a completely transparent part of the model. In order to schedule
across devices that may, or may not, share memory, we also in-
clude a memory management interface that allows CoreTSAR to
optimize data transfers. Speciﬁcally, it reduces transfer times by
allowing data to persist across regions and copying only those el-
ements currently necessary to complete the work assigned to the
target device.
3.1 Assigning Tasks
Evenly dividing homogeneous iterations across homogeneous re-
sources, as with the OpenMP static schedule, yields high perfor-
mance. Since iterations in programs can vary in runtime, OpenMP
supports additional types of scheduling (dynamic and guided) to
improve load balancing. These schedules for heterogeneous itera-
tions still target homogeneous resources so they work well on ho-
mogeneous, shared-memory resources with low concurrency con-
trol costs. However, they are less appropriate for heterogeneous
resources due to varying costs and synchronization requirements.
Since CoreTSAR targets heterogeneous resources with distributed
memories, we provide different schedules.
Our adaptive scheduler assigns iterations at the boundaries of
parallel regions, or sub-regions, and re-evaluates at their ends.
This approach reduces locking overhead but does not balance load
dynamically like OpenMP dynamic or guided. To provide balanced
schedules, CoreTSAR predicts the time to compute an iteration on
each resource in the next pass based on previous executions.
We base our schedules around assigning each device enough
work to ﬁnish in the same amount of time as all others. CoreTSAR
tracks the average time to complete an iteration on any given
device, which it uses to predict the amount of work that each device
can complete in the next pass. For example, assuming we have a
system with two CPU cores and one GPU, with one CPU core
controlling the GPU, if the CPU core completes 10 iterations in
the same time that the GPU takes to copy in data, to complete 40
iterations, and to copy back the results, then the CPU should be
assigned 20% of the operations in the following pass. Alternatively,
if those numbers are reversed, the CPU would be assigned 80%.
We thus determine the relationship between compute units and can
compute the amount of work to provide each device to balance their
loads. However, we must extend this simple approach to more than
two devices and to choose an initial split.
3.2 Applying Ratios
We use a linear program to extend our approach to arbitrary device
counts.Thelinearprogramcomputestheiterationstoassigntoeach
device based on their time per iteration. Figure 2 lists its variables
(Equation 1), objective function (Equation 2) and accompanying
constraints (Equations 3-6). The program minimizes the difference
between the predicted runtime for each device, such that all should
ﬁnish at the same time. We assume that performance of an average
iteration does not change across instances of the region. The time
for a device to ﬁnish its work in the next pass equals the time
per iteration from the previous pass multiplied by the number
of iterations that it is assigned. In practice this assumption holds
well: although iterations may have different computational cost,
the same iteration across two passes often has similar performance,
rendering accuracy within a few percent for our test applications.
2 2012/11/6I = total iterations available
ij = iterations for compute unit j
fj = fraction of iterations for compute unit j
pj = recent time/iteration for compute unit j (1)
n = number of compute devices
t+
j (or t 
j ) = time over (or under) equal
min(
n 1 X
j=1
t+
1 + t 
1 ···+ t+
n 1 + t 
n 1) (2)
n X
j=0
ij = I (3)
i2 ⇤ p2   i1 ⇤ p1 = t+
1   t 
1 (4)
i3 ⇤ p3   i1 ⇤ p1 = t+
2   t 
2 (5)
. . .
in ⇤ pn   i1 ⇤ p1 = t+
n 1   t 
n 1 (6)
Figure 2: Linear program variables, objective and constraints
3.3 Static Scheduling
Our static schedule uses the linear program to assign iterations
based on either an input value or our computed default division.
To increase portability, we compute a default division at runtime
rather than using a precomputed static value. We assume that one
instruction cycle on any GPU takes the same time as four on a CPU,
which models a CPU with a SIMD unit and higher clock frequency.
While this assumption does not hold in general, we can portably
compute an initial time per iteration for each device. We compute
the time per iteration for a GPU as tgi =1 /cg/4 and for CPU
cores as 1   tgi (where cg is the number of multiprocessors on a
GPU, in the case of multiple distinct GPUs, the largest value is used
for the CPU cores), which scales based on the compute resources
available. For applications that are not dominated by ﬂoating-point
computation, we have considered models that include several other
factors, including memory bandwidth and integer performance.
However, these models have proven less successful; the selection
of their appropriate static model is future work.
3.4 Adaptive Scheduling
Our adaptive schedules (Adaptive, Split and Quick) use the static
schedule as an initial training step. From the time that each device
takes to complete its iterations in this static pass, we use our linear
program to compute the appropriate division of work for subse-
quent passes. Our design intentionally includes all recurring data
transfer and similar overheads required to execute an iteration on
a particular device. Thus, we incorporate those overheads into the
cost of the iteration and naturally account for them. The Adaptive
schedule trains on the ﬁrst instance of the region and then each
subsequent instance. The Split schedule accommodates regions that
may only run once or that may beneﬁt from scheduling more often.
It breaks each region instance into several evenly split sub-regions,
based on the div input. Each time a sub-region completes, we use
thelinearprogramtosplitthenext.Thisschedulecanprovidebetter
load balance at the cost of increased scheduling and kernel launch
overhead.Thus,itisimpracticalforshortregionsandoverheadsen-
sitive applications. The Quick schedule balances between the Split
and Adaptive schedules by executing a small sub-region for its ﬁrst
training phase, similarly to Split. It then immediately schedules all
remaining iterations of the ﬁrst region instance and uses the Adap-
tive schedule for any subsequent instances. This schedule suits ap-
plications that cannot tolerate a full instance using the static sched-
ule or the overhead of extra scheduling steps in every pass.
3.5 Memory Management
A key to efﬁcient region execution across multiple memory spaces
is accurately determining the minimum amount of data movement
necessary to complete the computation. CoreTSAR is designed to
reducethedatamovementrequirementsofapplicationsovervanilla
Accelerated OpenMP in two key ways. First by the ability to spec-
ify that a data block should be persistently allocated on a device
and remain resident until it is explicitly removed. The Accelerated
OpenMP directives currently do not offer this, instead any given
block of data is associated with a region, explicitly or implicitly,
and in the case of PGI Accelerator and OpenACC can only be used
on more than one GPU by having multiple threads running at the
point where that region begins. As a result of these limitations, a
user may be forced to repeatedly copy data, which is known to be
static, to avoid major code restructuring. The addition of persistent
copies removes this potential obstacle. Second, we allow the user
to specify the data that each task requires individually as a part of
the input and output data blocks. We have provided a simple mem-
ory management interface that takes the start of an array or matrix,
the size of the element that each task will use, and the number of
those elements each task should receive. Combining that with the
iteration variable produces the necessary data for a given task. This
approach supports simple determination of the regions necessary
for runs of successive tasks. While this simple pattern clearly will
not work for all applications, we have found that many applica-
tions can use this interface without modiﬁcation. Either it naturally
associates tasks to data, which is common since it is best for cache-
locality, or tasks require the entire data region, as with random or
unpredictable accesses. Future work will develop a more compre-
hensive speciﬁcation methodology.
4. Implementation
We implement CoreTSAR as a library on top of Accelerated
OpenMP. We have tested it with PGI Accelerator as well as Cray’s
Accelerated OpenMP prototype. Our evaluation in this paper fo-
cuses on PGI Accelerator, so we use its directive format in our
examples. This section discusses our implementation including its
portability, its API and our automatic memory manager as well as
some necessary deviations from the abstract design discussed in
Section 3.
4.1 Shifts from Design
Several issues lead to pragmatic shifts from our high-level design.
The underlying Accelerated OpenMP implementations assume all
threads of a team participate in barriers. Some applications do
not suit GPU computing while others have ﬁne-grained parallel
regions. We now discuss how we accommodate these issues.
4.1.1 Scheduler Overhead
Since our schedules repeatedly solve a linear program, that over-
head could be an issue, especially for applications with many short
regions. We use the lp solve [5] library, a highly optimized linear
program solver that can use the previous solution’s tableau as a par-
tial result. This incremental approach greatly reduces overhead.
Figure 3 represents the time spent in CoreTSAR over 1,900
passes and 19,000 scheduling iterations with the split scheduler.
Unfortunately, we ﬁnd that our original linear model has expo-
nential time complexity as the number of devices increases. In
the worst case the split schedule with four GPUs takes nearly 3⇥
3 2012/11/6Adaptive Split
0
2
4
6
8
10
12
0
20
40
60
80
100
120
1 2 3 4 1 2 3 4
Number of GPUs
T
i
m
e
 
i
n
 
C
o
r
e
T
S
A
R
 
(
s
e
c
o
n
d
s
)
Implementation Optimized Original
Figure 3: Time spent in CoreTSAR during 1,900 passes.
min(
n 1 X
j=1
t+
1 + t 
1 ···+ t+
n 1 + t 
n 1) (7)
n X
j=0
fj =1 (8)
f2 ⇤ p2   f1 ⇤ p1 = t+
1   t 
1 (9)
f3 ⇤ p3   f1 ⇤ p1 = t+
2   t 
2 (10)
. . .
fn ⇤ pn   f1 ⇤ p1 = t+
n 1   t 
n 1 (11)
Figure 4: Modiﬁed objective and constraints
longer than the 40-second compute phase (over 90% in the lp solve
routine).
Two issues reduce the solver’s performance. First, its input has
widely distributed values, which leads to poor numerical stability
that slows convergence and frequent ﬂoating-point error correc-
tions. Second, about half of the matrix values are integers, which
signiﬁcantly increases computational complexity.
Toalleviatetheseissues,weremoveintegeroutputrequirements
and keep values near the range of 0 to 1. Figure 4 shows the new
objective function (Equation 7) and constraints (Equations 8-11).
We now compute the ﬂoating-point fraction of total iterations fj
all of which must sum to 1. These changes produce the optimized
results that Figure 3 shows. Thus, the time in CoreTSAR can
actually decrease as the number of GPUs increases because of GPU
performance consistency. Despite the larger matrix, the solution
converges faster since it deviates less from the previous solution.
The runtime multiplies each fj by I and assigns the nearest
lower integer. This approximation can fail to assign up to n itera-
tions optimally. We assign these iterations round-robin to devices
so each device is at most 1 iteration above or below optimal. Since
most applications execute thousands of iterations, the small devia-
tion is within the error threshold of our measurements.
4.1.2 GPU-Averse Applications
CoreTSAR handles applications that run more efﬁciently using one
CPU core than using that core to control a GPU. Some applications
cannot run a GPU pass quickly enough to match the CPU execution
time regardless of the number of iterations. We can identify when
a given device cannot match the performance of other devices and
take action since we track the time per iteration of each device.
ctsar * ctsar_init(int size, ctsar_type sched,
ctsar_dev_type allowed_devs,
double *rat,int *div);
ctsar * ctsar_next(ctsar * c, int size);
int ctsar_loop(ctsar * c);
void *c t s a r _ r e g _ m e m ( c t s a r*c ,void *p t r ,
size_t item_size, size_t count
ctsar_mem flags);
void ctsar_unreg_mem(ctsar * c, void *p t r ) ;
void ctsar_swap_mem(ctsar * c, void *p t r ,void *p t r ) ;
typedef struct ctsar{
ctsar_type type; //scheduler type in use
ctsar_device devices[];//devices by thread id
void *i n t e r n a l ; //opaque internal values
}ctsar;
typedef struct ctsar_device{
ctsar_dev_type type; //type of device
size_t start; //start of device iterations
size_t end; //end of device iterations
void *i n t e r n a l ; //opaque internal values
}ctsar_device;
Figure 5: CoreTSAR API
Thus, CoreTSAR converts a GPU ofﬂoad thread into a CPU thread
when the GPU has a higher time per iteration than the slowest CPU
core for a conﬁgurable number (default is two) of iterations. We
discuss the effects of this extension further in Section 5.
4.2 CoreTSAR API and Usage
The CoreTSAR API consists of six functions and two struc-
tures (and associated values), as Figure 5 shows. Figure 6 presents
an example transformation for a real-world example. We use the
manual code in our k-means benchmark evaluation (variable names
shortened for space). A comment in the original version shows our
proposed clause to invoke CoreTSAR’s functionality through di-
rectives.
In the manually translated version, after initialization with the
adaptive scheduler, the two registration functions register a per-
sistent input and an output array that only needs to be partially
copied back. In subsequent iterations the associated device array
is retrieved by matching the input pointer’s address and returned
without reallocation, CPUs simply receive the input pointer back
immediately. The bottom of the example uses a swap to inform
CoreTSARthatthehostbuffertouseforcopieshaschanged,which
allows host-side double buffering without re-registration.
We encapsulate the main kernel in a parallel region that uses all
cores. Using OpenMP thread numbers as IDs, we obtain the device
structures from CoreTSAR. The inner loop of the parallel region
supports use of the split and quick schedules. The ctsar_loop()
function tracks the status of the work and continues or exits as
needed. Inside the loop, the ctsar_next() function takes the
number of iterations that have not yet been completed and dis-
tributes iterations to devices.
Note that there is no replication of code, each thread is assigned
iterations to compute in their associated device structure and select
whether they are using the GPU for acceleration or not using the
Accelerated OpenMP if() clause. In the event that the device is
a CPU, the loop is run serially on the associated core completing
its assigned iterations. If, on the other hand, it is a GPU-controlling
thread, the deviceptr() clause passes in memory regions pro-
vided by CoreTSAR rather than specifying explicit array slices in
the directive. In this code, the library handles timing and all mem-
ory allocation and movement, except that of fc, which greatly sim-
pliﬁes the design compared to our previous work. Even with the
semantic improvements, the CoreTSAR version is more verbose
4 2012/11/6ctsar * c = ctsar_init(no, CTSAR_ADAPTIVE,
CTSAR_DEV_CPU | CTSAR_DEV_GPU,
NULL, NULL);
#pragma omp parallel shared(fo,fc) private(i,j,k)\
firstprivate(no,ncl,nco)
{
int tid = omp_get_thread_num();
ctsar_device * dev = c->devices[tid];
float *c f o , c m ;
cfo = (float *)ctsar_reg_mem(c, fo, sizeof(int),
no*nco, CTSAR_MEM_PERSIST | CTSAR_MEM_INPUT);
cm = (float *)ctsar_reg_mem(c, m, sizeof(int),
no, CTSAR_MEM_PARTIAL | CTSAR_MEM_OUTPUT);
do{
c=c t s a r _ n e x t ( n o ) ;
int gts = dev->start, gte = dev->end;
#pragma acc region for independent private(i)\
deviceptr(cm,cfo) copyin(fc[0:ncl*nco])\
copyin(nco,no,ncl,gts,gte) \
if(dev->type == CTSAR_DEV_GPU)
for (i=gts; i<gte; i++) {
cm[i] = findc(no,ncl, nco, cfo, fc,i);
}
}while(ctsar_loop());
ctsar_swap_mem(c,m,m2);
}
swap(&m,&m2);
#pragma acc region for independent private(i) copyout(m)\
copyin(fc[0:ncl*nco],fo[0:no*nco]) \
copyin(nco,no,ncl,gts,gte)\
hetero(1,quick) \
persist(fo) \
partial(m) /* proposed extension */
for (i=0;i < n o ;i + + ){
m[i] = findc(no,ncl,nco,fo,fc,i);
}
swap(&m,&m2);
Figure 6: Manually transformed k-means kernel (top) and proposed
extensionsforautomatictransformationofthek-meanskernel(bot-
tom).
than the original, but it can use all CPU cores and all GPUs without
code replication.
4.3 Memory Management
Our previous heterogeneous scheduler design, handled memory by
sending the entire inputs to the GPU at the beginning of a region,
and copying the entire output arrays back at the end. Some bench-
marks use Accelerated OpenMP data regions to manage persistent
memory at a higher level that allows the input copy phase to be
skipped when using one GPU. Nonetheless, far more data than
necessary is copied. It used that design because existing Acceler-
ated OpenMP implementations did not support partial array copies.
With this support now available, we use it to reduce memory trans-
fer overhead and to eliminate the need to merge multiple copies of
output arrays in a separate step. Further, we found that we could not
emulate a data region for persistent data to be placed on multiple
GPUs. The model supports doing so with one GPU but the required
nesting order with multiple GPUs is complicated and frequently in-
feasible.
Our solution uses the deviceptr() clause to pass pointers to
memory that are allocated on the GPU and managed by CoreT-
SAR. We offer a straightforward syntax that supports automatic
management ofpartial array copies anddata regions across devices.
CoreTSAR currently takes a pointer to CPU memory, the size of
an element of that array, the number of array elements, and a ﬂag
option that allows the user to control copy behavior. Flags control
whether memory is copied in or out or both. CoreTSAR also sup-
ports two special cases: persistent memory and partial copies. The
manually translated code in Figure 6 shows examples. While we
control memory copies instead of the compiler, and thus lose any
compiler optimizations, the resulting applications provide equal or
better performance than those produced by allowing the PGI Ac-
celerator compiler to handle copies, and the added ﬂexibility of of-
fering persistent memory regions across multiple GPUs improves
on that even more by avoiding unnecessary copies of static data.
Regardless of the ﬂags, ctsar_reg_mem() allocates an appro-
priate size buffer on the device associated with the calling thread.
If the region is set to persistent, we copy the data from the CPU
array into the newly allocated memory, where it resides until it is
explicitly removed with a call to ctsar_unreg_mem(). Given a
copy direction (in or out) but no special ﬂags, we copy the region
to the GPU in ctsar_next() and from it in ctsar_loop() when
the return value will cause the thread to exit the outer loop. Regions
that are marked partial indicate that the region only needs part of
that memory to execute the compute kernel.
Currently, CoreTSAR only handles simple partial copies, as
described in Section 3.5, in which it copies the elements of the
array that correspond to the iterations assigned to each thread,
to or from its memory space. This basic support only covers a
small subset of the total possible cases but covers most cases that
arise in practice, partly because OpenMP codes tend to work on
sequentially allocated memory locations in a given thread. Given
an item size, we can use this system either to copy a range of items,
such as ﬂoats, that directly correspond to loop iterations, or to
copy a run of 100 ﬂoats that correspond to each loop iteration. For
example, the Helmholtz benchmark uses a two dimensional matrix,
allocatedasanarray,inwhicheachthreadcomputesasetofrowsin
that matrix. While each thread computes more than one output cell,
since they are sequential, we can implement the necessary partial
copies by specifying the item size as the size of a single element
times the length of a row. Thus, we do not require special handling
beyond what CoreTSAR offers.
The one special-purpose treatment worth mentioning is that of
reductions. Since the output of a reduction is not speciﬁc to a single
thread, we keep one version in each memory space and merge the
results at the end in a short reduction pass on the CPU cores.
4.4 Portability
CurrentlyCoreTSARassumesthatOpenMPmanagesCPUthreads,
and that accelerators are CUDA-enabled GPUs. Thus, we use
omp_get_thread_limit() to determine the number of CPU re-
sources, cudaMalloc()/cudaMemcopy() to allocate and to man-
age accelerator memory and cudaGetDeviceProperties() to
read the number of cores and capabilities of the GPU for initial
performance estimation. Our design does not require these as-
sumptions so we could easily add other targets or replace these
interfaces.
5. Evaluation
This section evaluates the CoreTSAR library. We compiled all
benchmarks with the PGI Accelerator compiler, pgcc, pgCC or
pfortran as appropriate, version 12.5. Optimization ﬂags are -O3,
-mp=allcores, and enabling compilation for NVIDIA GPUs includ-
ing compute capabilities 1.3 and 2.0. Table 1 lists our test plat-
forms. Unless otherwise speciﬁed, we ran tests on escaﬂowne. All
tests employ all CPU cores, in tests with GPUs enabled, one thread
is used to control each selected GPU, and thus does not do compu-
tation. We use default scheduler parameters unless otherwise spec-
iﬁed, with the initial split calculated at runtime based on the avail-
able resources and a div of 10. We include all scheduling overhead,
GPU data transfer time, and synchronization time.
5 2012/11/6CPU CPU CPU CPU GPU GPU GPU GPU
System name Model Cores/die Dies RAM (MB) Model Cards Cores RAM (MB)
amdlow3 E3300 2 1 2,012 Tesla C2050 1 448 3,071
armor1 E5405 4 2 3,964 GeForce GT 520 1 48 1,023
dna2 i5-2400 4 1 7,923 GeForce GTX 280 1 240 1,023
dna3 i5-2400 4 1 7,923 Tesla C2075 1 448 5,375
escaﬂowne X5550 4 2 24,154 Tesla C2070 4 448 5,375
Table 1: Test system speciﬁcations, all CPUs and GPUs are made by Intel and NVIDIA respectively.
We evaluate CoreTSAR with ﬁve applications that we manu-
ally accelerated using the PGI Accelerator directives, and subse-
quently enhanced with the CoreTSAR library. These applications
are GEM [1, 15, 16], k-means, CG [4], Helmholtz and CORR [17].
We chose these applications because they provide a range of ap-
plication behaviors in terms of data sizes, region lengths, and GPU
suitability as well as hitting many of the potential corner cases and
difﬁcult situations CoreTSAR would likely face in real-world use.
As our primary concern is scheduling behavior and not computa-
tional kernel performance, we ported each from OpenMP (except
CORR, which used HMPP originally) to PGI Accelerator with the
minimum change possible. This decision provides a realistic use of
Accelerated OpenMP, at least at ﬁrst, to evaluate the suitability of
GPUs, and is how we envision CoreTSAR being used if integrated
directly into the directives. Thus, we obtain computational kernels
that are not optimized for the GPU other than by the compiler.
However, CoreTSAR supports users providing an OpenMP CPU
code and a manually implemented and highly optimized CUDA
kernel, as long as the inputs and outputs of the two are compatible.
CoreTSAR would schedule across them and offer memory man-
agement with the same syntax.
During a given run, we collect various statistics and results.
We deﬁne computation time as the time to complete all compu-
tations and all initialization and ﬁnalization necessary to run the
computations. We include all time that is not required for the orig-
inal OpenMP CPU code to function, such as library initialization,
scheduling, and memory movement. We do not include application
IO or problem setup that is shared between CPU, GPU and sched-
uled versions. We also record the time for each thread to complete
its assigned iterations, from which we can compute the time that
threads wait for others to complete, the time spent to calculate the
split for the next pass and, as a subset of that, the time taken to up-
date the linear model solution to the new values. Finally, we track
the time per iteration for each thread, as described in Section 3.
5.1 Benchmarks
As mentioned above, we employ ﬁve benchmarks in our evaluation.
CG is a direct port of the NAS parallel benchmarks, their imple-
mentation of the conjugate gradient method. CORR is a correlation
code that computes an upper triangular 2D correlation matrix origi-
nally from the PolyBenchGPU benchmark suite. Each independent
iteration is one row of that output matrix so each iteration has a dif-
ferent load, which produces an unbalanced execution proﬁle. GEM
is a molecular modeling application for the study of the electro-
static potential along the surface of a macromolecule whose pro-
ﬁle has been well studied as a test case for GPU optimization [9].
Helmholtz is a discrete ﬁnite difference code that uses the Jacobi
iterative method to solve the Helmholtz equation. Finally, k-means
is a popular iterative clustering method. The ﬁve benchmarks can
generally be characterized by the number of passes through the par-
allel region that they make, the length of each of these passes, and
how suitable they are to run on the GPU. Table 2 presents values
that represent each of these properties for our ﬁve benchmarks.
The table exhibits a wide range in number of passes through
the parallel region – from 1 to 1900 passes. Our scheduler is de-
Benchmark Passes Time/ CPU (static) GPU Ratio
pass runtime runtime
CG 1900 0.02 50.56 273.04 0.85
CORR 10* 6.36 1540.43 70.97 0.45
GEM 1 1098.10 1098.35 107.43 0.06
Helmholtz 100 0.08 8.61 73.64 1.00
k-means 7 1.14 8.82 4.79 0.41
Table 2: Benchmark characteristics, times in seconds.
signed to operate primarily at the boundaries of parallel regions,
so this number can greatly affect how CoreTSAR interacts with
an application. For example, in the GEM benchmark, the adaptive
scheduler acts in a manner identical to the static scheduler because
the training pass is the only pass in the application. Conversely, CG
has many passes, which provides CoreTSAR with a wealth of op-
portunities to adjust and to correct scheduling decisions. That ben-
eﬁt comes at a cost. CG’s passes are very short, which accentuates
any scheduler overheads as well as data copy costs. CORR is also
important in this respect; it has only one pass by default, but is a
computational kernel that a real application would use repeatedly.
For our evaluation, we use 10 iterations; 5 and 20 iterations yield
similar relative results.
The table also shows a wide range of GPU suitability, which we
deﬁne as the quality of an algorithm or in this case an implemen-
tation for running on a particular device. Running GEM on only
one GPU ﬁnishes the problem more than 10⇥ faster than on eight
server class Intel CPU cores. CORR also shows extreme suitabil-
ity, largely a result of the static schedule employed in the CPU tests.
Because the workload is imbalanced, each CPU core is given a dif-
ferent amount of actual work. The GPU test, because of the multi-
block design of GPU kernels, handles this variation better. If we
use the OpenMP dynamic schedule, CORR runs in approximately
150 seconds, 10⇥ faster than the static performance. The ratios of
GEM and CORR reﬂect their high level of suitability, a ratio near
zero allocates most work to the GPU, near 1 almost all on the CPU,
as discussed in Section 3. K-means has a similarly high level of
suitability, while Helmholtz is so unsuitable that it beneﬁts from
turning off the GPUs. Generally, the suitabilities match our expec-
tations, with the exception of CG.
In our previous work, as well as that of others, CG has been
found suitable for GPUs. Some of our experiments on other plat-
forms showed a ratio of approximately 0.55. Here, the GPU version
takes more than 5⇥ longer than the CPU version on our primary
test system, partly because of the overhead of running so many
small kernels, but mostly because PGI Accelerator’s default as-
signment policy underutilizes the GPU multiprocessors. Optimiza-
tion could likely overcome this issue, but we leave this for future
work since we focus on scheduling properties rather than individual
benchmark GPU performance.
5.2 CoreTSAR Performance
We begin our evaluation of CoreTSAR with an evaluation of the
overall speedup achieved for benchmarks across schedulers on one
GPU and all cores on escaﬂowne, as Figure 7a shows. For the
applications that suit GPUs, all but Helmholtz, one or more of
6 2012/11/6cg corr gem helmholtz kmeans
0.0
0.2
0.4
0.6
0.8
1.0
0
5
10
15
20
0
2
4
6
8
10
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
1.0
1.5
2.0
C
P
U
G
P
U
S
t
a
t
i
c
A
d
a
p
t
i
v
e
S
p
l
i
t
Q
u
i
c
k
C
P
U
G
P
U
S
t
a
t
i
c
A
d
a
p
t
i
v
e
S
p
l
i
t
Q
u
i
c
k
C
P
U
G
P
U
S
t
a
t
i
c
A
d
a
p
t
i
v
e
S
p
l
i
t
Q
u
i
c
k
C
P
U
G
P
U
S
t
a
t
i
c
A
d
a
p
t
i
v
e
S
p
l
i
t
Q
u
i
c
k
C
P
U
G
P
U
S
t
a
t
i
c
A
d
a
p
t
i
v
e
S
p
l
i
t
Q
u
i
c
k
Scheduler
S
p
e
e
d
u
p
 
o
v
e
r
 
8
 
c
o
r
e
 
O
p
e
n
M
P
Scheduler CPU GPU Static Adaptive Split Quick
(a) CoreTSAR speedup using one GPU across schedulers
1 2 3 4
0
5
10
15
20
0
10
20
30
0
1
2
3
4
c
o
r
r
g
e
m
k
m
e
a
n
s
CPU
GPU
Static
Adaptive
Split
Quick
GPU
Static
Adaptive
Split
Quick
GPU
Static
Adaptive
Split
Quick
GPU
Static
Adaptive
Split
Quick
Scheduler
S
p
e
e
d
u
p
 
o
v
e
r
 
8
 
c
o
r
e
 
O
p
e
n
M
P
Scheduler CPU GPU Static Adaptive Split Quick
(b) CoreTSAR speedup across schedulers and number of GPUs for GPU
amenable benchmarks
Figure 7: CoreTSAR evaluation on escaﬂowne.
GPU Static Adaptive Split Quick
0.0
0.2
0.4
0.6
0.8
1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4 1 2 3 4
Number of GPUs
S
p
e
e
d
u
p
 
f
r
o
m
 
8
−
c
o
r
e
 
C
P
U
Implementation Optimized Original
Figure 8: Helmholtz with and without GPU backoff support
the CoreTSAR modes, Static, Adaptive, Split and Quick, improve
performance over using either the GPU or CPU alone. Performance
of k-means improves by as much as 2⇥, GEM and CG by about
10%, and in the case of CORR by a tiny margin over GPU only.
5.2.1 GPU Averse Applications
As we discussed earlier, Helmholtz was selected speciﬁcally be-
cause the application is GPU-averse. In no circumstance that we
have found does it perform better by using a GPU for any work
at all. We include this benchmark because real applications are
likely to include sections like this, and should not suffer much per-
formance degradation as a result of using CoreTSAR. More im-
portantly, some applications may perform well using a GPU in a
particular conﬁguration but not another, and users need consistent
performance running on either hardware without being forced to
rewrite their code. We discuss this topic further in Section 5.4.
For the single GPU case, one of the lower overhead schedulers,
such as Adaptive or Quick, can reduce the performance loss of
Helmholtz by quickly converting GPU threads to CPU threads.
To evaluate the effect of our extension that converts GPU threads,
Figure 8 displays the results for Helmholtz without GPU back-off
in Original and with it in Optimized. While neither result matches
the CPU performance, Optimized is as much as 24⇥ faster than
Original, and, having released the GPUs, uses fewer resources.
Regardless of the number of GPUs, CoreTSAR decides to stop
using all of them. While we do not achieve a speedup, we expect
that some accelerators, especially upcoming ones that share the
memory subsystem with the CPU, may be able to accelerate this
application. In the meantime we can mitigate the loss not only
for this application but for any application that runs poorly on a
given accelerator. For example, CG beneﬁts from it whenever we
assign more than one GPU, by turning off all but one GPU to avoid
detrimental data-movement overheads.
5.2.2 GPU Amenable Applications
The other three benchmarks (CORR, GEM and k-means) bene-
ﬁt from using more than one GPU, as Figure 7b shows. CORR
does not behave like the others. Moving from one to two GPUs in-
creases performance, but performance improves only slightly with
more GPUs. The Quick scheduler even loses performance as we
add more GPUs, due to an incorrect decision made based on the
imbalanced load presented by CORR in the training pass. This re-
sult shows that Quick and Split are not appropriate for applications
withhighlyvariablein-passworkloads.Adaptiveperformancedoes
not improve because the benchmark only has enough work to uti-
lize one GPU fully and part of a second; adding more GPUs leaves
them underutilized and does not further reduce computation time.
GEM and k-means on the other hand scale almost perfectly up
to four GPUs and four CPU cores. Speciﬁcally, the runtime for the
default dataset falls from 101 seconds for GEM or 3.29 seconds for
k-means with one GPU and the Quick scheduler to 27 seconds and
1.4 seconds with four GPUs using Adaptive, which implies scal-
ing values of 3.74⇥ and 2.35⇥ respectively. While the scaling of
k-means may not appear linear, recall that we reduce the CPU re-
sources as we add GPU resources. Thus, 4⇥ is an unreasonable
expectation. 8 CPU cores achieve as much performance as 80% of
7 2012/11/6Num. of GPUs 1 2 3 4
Benchmark
CG persist 257.48 264.94 268.52 268.59
CG 438.58 664.81 892.59 1109.84
CORR persist 106.39 67.05 66.98 66.58
CORR 113.22 66.89 66.73 66.59
GEM persist 101.01 52.72 35.65 27.10
GEM 101.01 52.72 35.65 27.10
Helmholtz p. 73.22 48.53 50.45 48.75
Helmholtz 86.51 64.25 70.81 75.15
k-means p. 4.61 2.63 1.88 1.49
k-means 4.76 2.80 2.08 1.77
Table 3: Time to run each benchmark with the static schedule with
and without persistent memory available
one GPU, so we are effectively scaling from 1.8 GPUs to approxi-
mately 4.4 GPUs, which corresponds to peak scaling of 2.4⇥, only
very slightly more than achieved. With k-means, the most efﬁcient
scheduler changes as the number of GPUs increases. Including the
GPU copy time in the iteration time causes the Quick scheduler
to underestimate the performance of the GPUs in the initial short
training phase, then underutilize them during the rest of the ﬁrst
pass. As the GPU count increases, the effect intensiﬁes and the
Adaptive scheduler performs progressively better despite the fact
that they both make equivalent decisions in subsequent passes.
5.3 Memory Performance
We now evaluate our memory manager and some of its optimiza-
tions (Section 4.3). The partial array copy and persistent array op-
tions are of particular interest. We can not disable partial memory
copy and maintain correctness in our current design, so we do not
evaluate it directly. We can manipulate persistent memory across
multiple GPUs. Without CoreTSAR, placing memory on a GPU,
or especially multiple GPUs, and retaining that memory from the
ﬁrst entry into a region through to the last is prohibitively difﬁcult.
Each of our applications has at least a small data region that does
not change, either a range of constants in the case of CORR, or the
large list of observations in k-means. Enabling persistence on these
data structures can greatly reduce the memory overhead in some
cases as Table 3 shows.
We use the Static scheduler results for the purpose of this com-
parison because the Adaptive scheduler would in some cases de-
tect the excess overhead and cloud the results by converting GPU
threads to CPU threads. The most immediately striking difference
is in CG, where the overhead caused by a lack of persistent mem-
ory exceeds 3⇥. CORR, due to low data copy overhead, and GEM,
because data is only copied once, are largely unaffected. Helmholtz
has an interesting pattern: the copy overhead decreases with more
GPUs, because its copy overhead is mostly from the partial data
that must be copied every iteration, which is increasingly paral-
lelized with more GPUs. K-means initially shows little difference,
but reaches approximately 15% overhead with four GPUs due to
a lack of persistent memory. Despite its simplicity, existing Accel-
erated OpenMP do not include a comparable persistent memory
option.
5.4 Adaptation Across Machines
As we mentioned in earlier sections, one of our primary goals
is to create a system that offers the ﬂexibility of OpenMP, auto-
matically adapting to available resources portably, for accelerator-
based codes and systems. In Section 5.2 we showed that CoreT-
SAR adapts well to different numbers of GPUs in a given system.
We now evaluate how CoreTSAR performs across a set of highly
distinct systems. We use a group of diverse systems that we boot
with the same system image to reduce software differences, and test
our benchmarks across them all. Table 1 shows the systems. Other
than our main test system evaluated elsewhere (escaﬂowne), these
systems have only one GPU. However, the GPUs are from several
generations with varying amounts of memory (from an NVIDIA
GT 520 at the low end up to an NVIDIA C2075). The CPUs are
even more diverse, ranging from a single 2-core Intel Celeron to
two quad-core Intel Xeon processors. Of particular interest are the
GPU-centric system amdlow3, which contains a dual-core Intel
Celeron processor and NVIDIA C2050 GPU, and the CPU-centric
system armor1 with two quad-core Intel Xeon cores and a low
power NVIDIA GT 520. For this evaluation, we use all bench-
marks except Helmholtz and CORR, as the two smaller memory
GPUs cannot hold our default problem sets for these benchmarks
in memory. Figure 9a shows our results across these systems.
The most prominent feature of the results across systems is the
signiﬁcant change in overall speedup. In particular, amdlow3 ex-
hibits consistently high speedups using the GPU for anything, al-
though the Split scheduler suffers from high scheduling overhead
due to the severe imbalance between its Intel Celeron processor
and NVIDIA C2050 GPU. Even CG shows material speedups on
the system, as much as 10⇥. More importantly, even though the
speedup and overall performance shift across the various systems
for each benchmark, the distribution of performance by scheduler
is similar across them. Thus, ﬁnding the right scheduling algorithm
to use with CoreTSAR is more about the application than the hard-
ware. A user can spend the time to select the proper scheduling
option once per region and then apply that choice across a wide ar-
ray of machines. Further, these results show that the default adap-
tivescheduleriseffectiveacrosshardwareconﬁgurations,withonly
GEM on amdlow3 as an issue, as a result of its single iteration and
being unable to recover. GEM’s strong performance on the other
devices also showcases the portability of our computed default di-
vision of work, which for that application is consistently near the
optimal on systems with capable CPUs.
We also investigate the actual division of work between differ-
ent processing elements in Figure 9b. Each benchmark has its own
pattern, with CG using mostly CPU cores on most machines, ex-
cept amdlow3, and GEM using the GPU almost entirely, except
on armor1. While the distribution of overall performance in Fig-
ure 9a shows that the Adaptive scheduler often achieves high per-
formance, the work division shows how differently each machine
behaves, which CoreTSAR hides from the end user.
6. Related Work
Task scheduling as a mechanism for easing parallel programming
has a long history. Traditional applications of the approach include
dynamic loop parallelization in OpenMP [10] and Intel’s TBB [25].
These mechanisms tend to offer simpliﬁed syntax for shared mem-
ory parallelism, but little to no support for heterogeneous archi-
tectures or distributed memory. With the rise of cluster and cloud
computing platforms, frameworks like Charm++ [21] and MapRe-
duce [11] perform a similar role, dividing tasks across distributed
resources automatically. In addition to its role as a cloud computing
model, MapReduce has been the target of various GPU acceleration
projects including Mars [18], MapCG [19], and StreamMR [14].
While some GPU MapReduce implementations support multiple
GPUs, they generally do not support coscheduling of CPUs and
GPUs on the same problem.
Task scheduling policies, especially in OpenMP and Charm++,
have also been the focus of signiﬁcant research. Work by Ayguad´ e
et al[3], directly inﬂuenced the design of CoreTSAR. They investi-
gated the possibility of removing or extending the OpenMP sched-
uler clause by calculating the distribution of work in future passes
8 2012/11/6cg, amd.3 cg, armor1 cg, dna2 cg, dna3 cg, esc.
gem, amd.3 gem, armor1 gem, dna2 gem, dna3 gem, esc.
kmeans, amd.3 kmeans, armor1 kmeans, dna2 kmeans, dna3 kmeans, esc.
0
2
4
6
8
10
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0
10
20
30
0.0
0.5
1.0
1.5
0
1
2
3
4
0
5
10
15
0
10
20
30
0
5
10
15
20
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.5
1.0
0.0
0.5
1.0
1.5
2.0
0
1
2
3
4
Program, Machine abbreviation
S
p
e
e
d
u
p
 
o
v
e
r
 
O
p
e
n
M
P
 
a
c
r
o
s
s
 
a
l
l
 
l
o
c
a
l
 
c
o
r
e
s
Scheduler CPU GPU Static Adaptive Split Quick
(a) CoreTSAR speedup across systems
cg gem kmeans
0.0
0.2
0.4
0.6
0.8
1.0
amdlow3
armor1
dna2
dna3
escaflowne
amdlow3
armor1
dna2
dna3
escaflowne
amdlow3
armor1
dna2
dna3
escaflowne
Machine
P
e
r
c
e
n
t
 
o
f
 
w
o
r
k
 
a
s
s
i
g
n
e
d
 
t
o
 
e
a
c
h
 
d
e
v
i
c
e
Device type CPU GPU
(b) Work distribution across devices in each system, where a single device is
a single black box, with the Adaptive scheduler
Figure 9: CoreTSAR evaluation across machines
througharegionbasedontimesseenforeachcoreinpreviousones.
Their results showed the method was not always optimal but the so-
lution was efﬁcient and stable. Our ratio-based design works sim-
ilarly, although with a different mechanism to determine the split
between units, and explicit support for heterogeneous hardware.
WiththeproliferationofGPUsandothercomputationalacceler-
ators, task schedulers are being designed speciﬁcally for these envi-
ronments. StarPU [2] schedules heterogeneous tasks and serves as
the basis for various scheduling mechanisms. Other important work
includes Qilin [22] and the work by Jim´ enez et al. [20]. Two ma-
jor factors distinguish our work from these schedulers. First, they
target overall compute bandwidth by scheduling at the granular-
ity of function calls or kernels while we target compute latency by
scheduling the work inside a single parallel region. Second, they
require reimplementation of one’s code in a new programming lan-
guage (StarPU) or a new API (Qilin), or one must manually create a
function for each architecture. While CoreTSAR currently requires
one to split code manually, our Accelerated OpenMP extensions
will allow the compiler to automate this mechanical process.
Accelerated OpenMP implementations are beginning to prolif-
erate in industry, including HMPP [12], OpenACC [7], PGI Accel-
erator [27] and Cray Accelerated OpenMP [6]. Each one offers a
method for a user to target multiple GPUs and CPUs by explicitly
splitting the workload and targeting each device with a region or
codelet. They do not offer coscheduling within a single region. As
they all offer a different implementation of C or Fortran to GPU
translation, and do not offer in-region coscheduling, they are or-
thogonal to our work. Our goal is to offer an extension to this
model, using one of these as a platform for coscheduling rather
than competing with them.
OmpSs [13] offers another implementation of Accelerated
OpenMP in addition to those being slowly adopted by industry.
This programming model includes a coscheduling mechanism that
supports both CPUs and multiple GPUs. However, their evaluation
concluded that two CPUs always beat any combination of CPU
and one to two GPUs. Our approach yields signiﬁcant speedups by
using all available compute resources.
The dynamic load balancing scheme in Chen et al.’s [8] work
provides an efﬁcient work queue for load balancing and work
distribution across multiple GPUs. Their approach can schedule
individual blocks through the use of a persistent kernel on the GPU
that receives work in block size chunks from the CPU. This model
is somewhat orthogonal to our work. CoreTSAR could employ
a work-queue design much like theirs inside of an Accelerated
OpenMP implementation. That environment would offer greater
control to the scheduler, while also allowing the compiler to pre-
load all application kernels, thus avoiding the primary drawback of
Chen et al.’s work by creating the offbeat structure that it requires.
The scheduling framework presented by Ravi et al. [23] pro-
vides an interesting counterpoint to our approach. Building on their
generalized reduction framework and code generator [24], they
present scheduling mechanisms for multicore systems with a sin-
gle GPU. While we avoid the chunk scheduling scheme in order
to avoid the additional transfer overheads, they present an alterna-
tive approach using chunk based scheduling while mitigating the
overhead through runtime techniques.
Finally, HTS [26] is clearly the most closely related work.
Therein the authors investigate extensions to Accelerated OpenMP
to support coscheduling across CPUs and one GPU within a region.
It evaluates similar schedules to those that we explore in this work,
with the limitation that exactly two heterogeneous processors could
be uniquely scheduled. CoreTSAR reworks this previous approach
byextendingtheschedulingschemetoapplytoanarbitrarynumber
of devices allowing independent scheduling of individual cores as
well as multiple GPUs. Most importantly, our new approach does
not need to rely on static splitting or underlying schedulers between
“homogeneous” devices, which allows us to handle some classes
of imbalanced workloads directly and to manage NUMA effects
within a class of device better.
7. Conclusion
We have presented the design and implementation of CoreTSAR,
our automated task scheduler for accelerator-aware runtimes. We
make three major contributions: the design of our scheduler for
9 2012/11/6adaptive scheduling across arbitrary numbers of heterogeneous de-
vices; an implementation and optimization of that design; and our
evaluation across ﬁve scientiﬁc codes, and ﬁve distinct systems.
These contributions yield speedups as high as 3.74⇥ over the best
achievable by using all cores and a single GPU in our test sys-
tem. When compared to the original CPU performance on 8 cores,
we achieve as much as 40⇥ for one benchmark. These results,
along with the portability our schedulers displayed across systems,
clearly motivate the addition of a co-scheduling interface, such as
the hetero() clause that we propose, to Accelerated OpenMP.
As future work, we will investigate expanding CoreTSAR with
apersistentkernelwork-queueforﬁnergrainedGPUscheduling.In
the main scheduler, CoreTSAR could automatically detect NUMA
issues, and the association of GPUs, through PCI-E association,
to CPUs and manage these automatically for greater performance.
Finally, the memory management interface that we present is the
ﬁrst step towards a general interface for declaring the relationship
between tasks and the portions of inputs and outputs that they
require. Given that information, many schedulers, including ours,
could automatically manage input and output, providing signiﬁcant
value especially as computers become more complex.
References
[1] Ramu Anandakrishnan, Tom R.W. Scogland, Andrew T. Fenley,
John C. Gordon, Wu-chun Feng, and Alexey V. Onufriev. Accelerating
Electrostatic Surface Potential Calculation with Multi-Scale Approxi-
mation on Graphics Processing Units. Journal of Molecular Graphics
and Modelling, 28(8):904–910, 2009.
[2] CdricAugonnet,SamuelThibault,RaymondNamyst,andPierre-Andr
Wacrenier. StarPU: A Uniﬁed Platform for Task Scheduling on Het-
erogeneous Multicore Architectures. In Henk Sips, Dick Epema, and
Hai-Xiang Lin, editors, Euro-Par 2009 Parallel Processing, volume
5704, pages 863–874. Springer Berlin Heidelberg, Berlin, Heidelberg,
2009.
[3] Eduard Ayguad´ e, Bob Blainey, Alejandro Duran, Jes´ us Labarta, Fran-
cisco Mart´ ınez, Xavier Martorell, and Ra´ ul Silvera. Is the Schedule
Clause Really Necessary in OpenMP? In WOMPAT’03: Proceedings
of the Workshop on OpenMP Applications and Tools 2003. Springer-
Verlag, June 2003.
[4] D.H. Bailey, E. Barszcz, J.T. Barton, D.S. Browning, R.L. Carter,
L. Dagum, R.A. Fatoohi, P.O. Frederickson, T.A. Lasinski, R.S.
Schreiber, H.D. Simon, V. Venkatakrishnan, and S.K. Weeratunga.
The NAS Parallel Benchmarks. International Journal of High Per-
formance Computing Applications, 5(3):63 –73, 1991.
[5] M. Berkelaar, P. Notebaert, and K. Eikland.
lp solve:(mixed integer) linear programming problem solver.
http://lpsolve.sourceforge.net/5.0/, 2003.
[6] James C. Beyer, Eric J. Stotzer, Alistair Hart, and Bronis R. de Supin-
ski. OpenMP for Accelerators. In Barbara M. Chapman, William D.
Gropp, Kalyan Kumaran, and Matthias S. Mller, editors, OpenMP in
the Petascale Era, volume 6665, pages 108–121. Springer Berlin Hei-
delberg, Berlin, Heidelberg, 2011.
[7] CAPS Enterprise, Cray Inc., NVIDIA and the Portland
Group. The openacc application programming interface, v1.0.
http://www.openacc-standard.org, Nov. 2011.
[8] Long Chen, O Villa, S Krishnamoorthy, and G.R Gao. Dynamic
Load Balancing on Single- and Multi-GPU Systems. 2010 IEEE
International Parallel & Distributed Processing Symposium (IPDPS),
pages 1–12, 2010.
[9] M. Daga, T. Scogland, and W. Feng. Architecture-aware mapping and
optimization on a 1600-core gpu. In Parallel and Distributed Systems
(ICPADS), 2011 IEEE 17th International Conference on, pages 316–
323. IEEE, 2011.
[10] L. Dagum and R. Menon. OpenMP: An Industry Standard API for
Shared-Memory Programming. IEEE Computational Science & En-
gineering, 5(1):46–55, March 1998.
[11] Jeffrey Dean and Sanjay Ghemawat. MapReduce. Communications
of the ACM, 51:107, January 2008.
[12] R. Dolbeau, S. Bihan, and F. Bodin. Hmpp: A hybrid multi-core
parallel programming environment. In Workshop on General Purpose
Processing on Graphics Processing Units (GPGPU 2007), 2007.
[13] A. Duran, E. Ayguade, R.M. Badia, J. Labarta, L. Martinell, X. Mar-
torell, and J. Planas. OmpSs: A Proposal for Programming Het-
erogeneous Multi-Core Architectures. Parallel Processing Letters,
21(2):173–193, 2011.
[14] M Elteir, Heshan Lin, Wu-chun Feng, and T Scogland. StreamMR:
An Optimized MapReduce Framework for AMD GPUs. In Parallel
and Distributed Systems (ICPADS), 2011 IEEE 17th International
Conference on, pages 364–371. IEEE Computer Society, 2011.
[15] Andrew T. Fenley, John C. Gordon, and Alexey Onufriev. An Ana-
lytical Approach to Computing Biomolecular Electrostatic Potential.
I. Derivation and Analysis. The Journal of Chemical Physics, 129,
2008.
[16] John C. Gordon, Andrew T. Fenley, and Alexey Onufriev. An Ana-
lytical Approach to Computing Biomolecular Electrostatic Potential.
II. Validation and Applications. The Journal of Chemical Physics,
129:075102, 2008.
[17] S Grauer-Gray, L Xu, R Searles, and S Ayalasomayajula. Auto-tuning
a High-Level Language Targeted to GPU Codes. cis.udel.edu.
[18] B He, W Fang, Q Luo, N K Govindaraju, and T Wang. Mars:
a MapReduce framework on graphics processors. In Proceedings
of the 17th international conference on Parallel architectures and
compilation techniques, pages 260–269. ACM New York, NY, USA,
2008.
[19] Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and
Haibo Lin. MapCG: writing parallel program portable between CPU
and GPU. In PACT ’10: Proceedings of the 19th international con-
ference on Parallel architectures and compilation techniques. ACM
Request Permissions, September 2010.
[20] Vctor J. Jim´ enez, Llus Vilanova, Isaac Gelado, Marisa Gil, Grigori
Fursin, and Nacho Navarro. Predictive Runtime Code Scheduling for
Heterogeneous Architectures. In Andr Seznec, Joel Emer, Michael
OBoyle, Margaret Martonosi, and Theo Ungerer, editors, High Perfor-
mance Embedded Architectures and Compilers, volume 5409, pages
19–33. Springer Berlin Heidelberg, Berlin, Heidelberg.
[21] Laxmikant V. Kale and Sanjeev Krishnan. CHARM++. In OOPSLA
’93 Proceedings of the Eighth Annual Conference on Object-Oriented
Programming Systems, Languages, and Applications, pages 91–108.
ACM Press, 1993.
[22] Chi-Keung Luk, Sunpyo Hong, and Hyesoon Kim. Qilin: Exploiting
Parallelism on Heterogeneous Multiprocessors with Adaptive Map-
ping. In Proceedings of the 42nd Annual IEEE/ACM International
Symposium on Microarchitecture, page 45. ACM Press, 2009.
[23] V T Ravi and G Agrawal. A dynamic scheduling framework for
emerging heterogeneous systems. In High Performance Computing
(HiPC), 2011 18th International Conference on, pages 1–10, 2011.
[24] Vignesh T Ravi, Wenjing Ma, David Chiu, and Gagan Agrawal. Com-
piler and runtime support for enabling generalized reduction compu-
tations on heterogeneous parallel conﬁgurations. In ICS ’10: Proceed-
ings of the 24th ACM International Conference on Supercomputing.
ACM Request Permissions, June 2010.
[25] J. Reinders. Intel Threading Building Blocks. 2007.
[26] Thomas R. W. Scogland, Barry Rountree, Wu-chun Feng, and Bro-
nis R. de Supinski. Heterogeneous Task Scheduling for Accelerated
OpenMP. In 2012 IEEE International Parallel & Distributed Process-
ing Symposium (IPDPS), Shanghai, China.
10 2012/11/6[27] Michael Wolfe. Implementing the PGI Accelerator Model. In Pro-
ceedings of the 3rd Workshop on General-Purpose Computation on
Graphics Processing Units, page 43. ACM Press, 2010.
11 2012/11/6