Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques by Khatami, Zahra et al.
Redesigning OP2 Compiler to Use HPX Runtime
Asynchronous Techniques
Zahra Khatami1,2, Hartmut Kaiser1,2, and J. Ramanujam1
1Center for Computation and Technology, Louisiana State University
2The STE||AR Group, http://stellar-group.org
Abstract—Maximizing parallelism level in applications can be
achieved by minimizing overheads due to load imbalances and
waiting time due to memory latencies. Compiler optimization is
one of the most effective solutions to tackle this problem. The
compiler is able to detect the data dependencies in an application
and is able to analyze the specific sections of code for paral-
lelization potential. However, all of these techniques provided
with a compiler are usually applied at compile time, so they rely
on static analysis, which is insufficient for achieving maximum
parallelism and producing desired application scalability. One
solution to address this challenge is the use of runtime methods.
This strategy can be implemented by delaying certain amount of
code analysis to be done at runtime.
In this research, we improve the parallel application perfor-
mance generated by the OP2 compiler by leveraging HPX, a
C++ runtime system, to provide runtime optimizations. These
optimizations include asynchronous tasking, loop interleaving,
dynamic chunk sizing, and data prefetching. The results of
the research were evaluated using an Airfoil application which
showed a 40− 50% improvement in parallel performance.
Index Terms—HPX, OP2, Asynchronous Task Execution, In-
terleaving Loops, Controlling Chunk Sizes, Prefetching Data.
I. INTRODUCTION
Unstructured grids are well studied and utilized in vari-
ous application domains. OP2 provides a framework for the
parallel execution of these unstructured grid applications on
different multi-core/many-core hardware architectures [1], [2].
The main goal of developing OP2 is to provide an abstraction
level for users to parallelize their applications without having
to worrying about architecture specific optimizations. This
allows scientists to invest most of their time in understanding
their domain problems, without learning details of new archi-
tectures, and still achieve efficient utilization of the available
hardware. The framework is designed to achieve the near-
optimal scaling on multi-core processors [3], [4]. However,
as the compiler only has a static and defined access pattern
[5], [6], [7], its analysis is not enough to obtain desired parallel
scalability. In order to reach this goal, OP2 needs to be able
to extract parallelism automatically at runtime.
In this research, we propose different optimization meth-
ods that provide dynamic information for code generated
by the OP2 compiler, including providing asynchronous task
execution, interleaving different loops together, dynamically
setting chunk sizes of different dependent loops based on each
other, and prefetching data. These proposed techniques are
implemented using HPX runtime system via redesigning the
OP2 framework in a way that employs both compiler’s static
analysis and dynamic runtime information. HPX is a parallel
C++ runtime system that facilitates distributed operations and
enables fine-grained task parallelism resulting in a better load
balancing [8], [9]. It provides an efficient scalable parallelism
by significantly reducing processor starvation and effective
latencies while controlling overheads [10].
A closer analysis of unstructured applications reveals that
synchronization is only required between small tasks. Preva-
lent parallelization paradigms, however, coerce users to join
all tasks together before proceeding to the next step in the
application. In HPX, we can utilize the future construct to
allow every task to proceed as long as the values it depends
on are ready [11]. This feature allows the HPX to relax the
global barriers, enable flexibility, and improve the parallel per-
formance of applications. In this research, HPX uses futures
based techniques to develop a new task execution strategy for
codes generated by the OP2 compiler which is the basis for
asynchronous tasking and interleaving loops.
In order to control the overheads introduced by the creation
of each task, it is important to control the amount of work
performed by each task. This amount of work is known as the
chunk size [11], [12]. In addition, to properly interleave loops it
is important for each loop to have very similar execution times
which allows the waiting time between the execution of each
loop to be minimal. We propose to address these two obstacles
by creating a new execution policy which will dynamically
control the chunk sizes during the application’s execution. In
addition, we also propose to create a new cache prefetcher
that aids in prefetching data for each time step to reduce
memory accesses latencies. This method is implemented in
such a way that data of the next iteration step is prefetched
into the cache memory using a prefetching iterator called in
each iteration within a loop. The main difference between
this method and the other existing methods is that HPX
implementation combines a thread based prefetching method
with the asynchronous task execution, which results in having
asynchronous execution while prefetching data of all the
containers within a loop.
To our knowledge, we present a first attempt of redesigning
OP2 to utilize the runtime techniques for improving per-
formance of the parallel unstructured grid applications. The
combination of these proposed techniques should yield a more
portable and performant software stack for unstructured grid
applications and enable the applications to properly scale to
a higher level of parallelism compared to the existing OP2
ar
X
iv
:1
70
3.
09
26
4v
1 
 [c
s.D
C]
  2
7 M
ar 
20
17
implementation. The results evaluated in Section VI show
that the parallelization performances are improved by around
40 − 50% for an Airfoil application. The remainder of this
paper is structured as follows: Section II briefly introduces
OP2; Section III introduces a dataflow object in HPX; Section
IV shows the details of the dataflow implementation with
the new execution policy within OP2; Section V presents the
prefetching method implemented in one of the HPX parallel
algorithms, and Section VI evaluates the the scaling speedup
of the experimental tests. The conclusions and the future works
can be found in Section VII.
II. OP2
OP2 is an active library that provides a parallel execution
framework for unstructured grid applications on different
multi-core/many-core hardware architectures [1]. It utilizes a
source-to-source translator for generating code which targets
different hardware configurations [2], [3], [13]. The code can
be transformed easily into different configurations such as
serial, multi-threaded using OpenMP and CUDA, or hetero-
geneous which utilizes MPI, OpenMP, and CUDA [3]. In this
section, we first walk through a simple OP2 code to show
its implementation details and then we introduce the Airfoil
application which is used as a case study for this research.
A. Simple Code Implementation with OP2
This section generally shows how unstructured grids are
defined with OP2. The OP2 API handles the data dependencies
by providing mesh represented data layouts. The provided
framework is defined based on sets, data on sets, mapping
connectivity between the sets, and the computation on each
set [2], [14]. Sets can be nodes, edges or faces. In these
unstructured grids, the connectivity information is used to
specify different mesh topologies. Figure 1 shows a mesh
example that includes nodes and faces as sets. The value of
data associated with each set is shown below each set and the
mesh is represented by the connections between them.
Fig. 1: The mesh represented data layouts provided with OP2.
OP2 API for the mesh in figure 1 is shown as follows, which
is the C/C++ API and defines 12 edges and 9 nodes:
o p s e t nodes ;
o p d e c l s e t ( 9 , nodes , ” nodes ” ) ;
o p s e t edges ;
o p d e c l s e t ( 1 2 , edges , ” edges ” ) ;
The mapping that declares the connection between 2 nodes
is defined as follow:
i n t edge map [ 2 8 ] ={0 , 1 , 1 , 2 , 2 , 5 , 5 , 4 , 4 , 3 , 3 , 6 , 6 , 7 ,
7 , 8 , 0 , 3 , 1 , 4 , 2 , 5 , 3 , 6 , 4 , 7 , 5 , 8}
op map pedge ;
op decl map ( edges , nodes , 2 , edge map , pedge , ” pedge ” )
op decl map shows that each edge is mapped on two
different nodes. The values of each node and face are assigned
as follow:
f l o a t v a l u e F a c e [ 4 ] ={0 . 1 2 3 , 0 . 1 5 1 , 0 . 4 2 0 , 0 . 1 1 2} ;
f l o a t valueNode
[ 9 ] = { 5 . 3 , 1 . 2 , 0 . 2 , 3 . 4 , 5 . 4 , 6 . 2 , 3 . 2 , 2 . 5 , 0 . 9 } ;
op da t d a t a f a c e ;
o p d e c l d a t ( f ace , 1 , ” f l o a t ” , va lueFace ,
d a t a f a c e , ” d a t a f a c e ” ) ;
op da t d a t a n o d e ;
o p d e c l d a t ( node , 1 , ” f l o a t ” , valueNode ,
da ta node , ” d a t a n o d e ” ) ;
These sets and meshes are used to define a loop over a
given set. The more details about OP2 design and performance
analysis can be found in [1] and [14], which shows that all
unstructured grid applications can be easily described with sets
and meshes as shown in the above example. These methods
place no restriction on the algorithm and they allow the
programmer to choose unique operations on each set.
B. Airfoil Application
In this research, we study an Airfoil application,
which is a standard unstructured mesh finite volume
computational fluid dynamics (CFD) code, presented
in [15], for the turbomachinery simulation and consists
of over 720K nodes and about 1.5 million edges. As
described in [15] and [16], it consists of five parallel
loops: op par loop save soln, op par loop adt calc,
op par loop res calc, op par loop bres calc,
op par loop update, shown in figure 2. All of the
computations on each set are implemented within these
loops by performing operations of the user’s kernels defined
in a header file for each loop: save soln.h, adt calc.h,
res calc.h, bres calc.h and update.h. Each argument passed
to each loop is generated based on data values used with
op arg dat.
Figure 3 demonstrates op par loop save soln that applies
save soln on cells based on the arguments generated with
op arg dat using p q and p qold data values. The function
op arg dat creates an OP2 argument based on the information
passed to it. These arguments explicitly indicate that how
each of the underlying data can be accessed inside a loop:
OP READ (read only), OP WRITE (write) or OP INC (in-
crement to avoid race conditions due to indirect data access)
[1]. More details can be found in [2] and [13].
The loop parsed with OP2 in figure 4 illustrates how each
cell updates its data value by accessing blockId, offset b,
and nelem data elements. The arguments are passed to the
save soln user kernel subroutine, which does the computa-
tion for each iteration of an inner loop from offset b to
o p p a r l o o p s a v e s o l n ( ” s a v e s o l n ” , c e l l s ,
o p a r g d a t ( da t a a0 , . . . ) , . . . ,
o p a r g d a t ( da t a an , . . . ) ;
o p p a r l o o p a d t c a l c ( ” a d t c a l c ” , c e l l s ,
o p a r g d a t ( da ta b0 , . . . ) , . . . ,
o p a r g d a t ( da ta bn , . . . ) ;
o p p a r l o o p r e s c a l c ( ” r e s c a l c ” , edges ,
o p a r g d a t ( da t a c0 , . . . ) , . . . ,
o p a r g d a t ( da t a cn , . . . ) ;
o p p a r l o o p b r e s c a l c ( ” b r e s c a l c ” , bedges ,
o p a r g d a t ( da ta d0 , . . . ) , . . . ,
o p a r g d a t ( da ta dn , . . . ) ;
o p p a r l o o p u p d a t e ( ” u p d a t e ” , c e l l s ,
o p a r g d a t ( da t a e0 , . . . ) , . . . ,
o p a r g d a t ( da t a en , . . . ) ;
Fig. 2: Five loops used in Airfoil.cpp for saving old data values,
applying computation, and updating each data value.
o p p a r l o o p s a v e s o l n ( ” s a v e s o l n ” , c e l l s ,
o p a r g d a t ( p q ,−1 ,OP ID , 4 , ” d oub l e ” ,OP READ) ,
o p a r g d a t ( p qold ,−1 ,OP ID , 4 , ” do ub l e ” ,OP WRITE)
) ;
Fig. 3: op par loop save soln represents one of the loops used in
an Airfoil application.
void o p p a r l o o p s a v e s o l n ( char c o n s t ∗name ,
o p s e t s e t , op a rg arg0 , op arg a rg1 )
{
.
.
.
#pragma omp p a r a l l e l f o r
f o r ( i n t b l o c k I d x =0; b lock Idx<n b l o c k s ; b l o c k I d x
++)
{
i n t b l o c k I d = / / based on t h e b l o c k I d x
i n t nelem = / / based on t h e b l o c k I d
i n t o f f s e t b = / / based on t h e b l o c k I d
f o r ( i n t n= o f f s e t b ; n<o f f s e t b +nelem ; n ++)
{
.
.
.
s a v e s o l n ( . . . ) ; / / u s e r ’ s k e r n e l
}}
}
Fig. 4: #pragma omp parallel for is used for a loop parallelization
for an Airfoil application.
offset b+nelem of each iteration of an outer loop from 0 to
nblocks. Also it illustrates that OpenMP is used for the parallel
processing within a node. It is important to note that the
outputs of the computations shown in figure 2 cannot be passed
to the outside of the loop, therefore, the current OP2 design
doesn’t provide a method for interleaving loops together. This
creates implicit global barrier after each loop as the threads
inside the loop must wait to synchronize before exiting the
loop [17]. Barriers, naturally, impede optimal parallelization
by causing the parallel threads and processes to wait. In order
to solve this problem, this research sets out to optimize the
performance of code generated by the OP2 compiler using the
HPX runtime. The source-to-source code translator of OP2 is
written in Matlab and Python [13]. In this research, its Python
source-to-source code translator is modified to automatically
generate the parallel loops using HPX library calls.
III. HPX
In this research different dynamic optimizations are pro-
posed for improving the performance of code generated by
the OP2 compiler that are implemented using HPX runtime
system, which has been developed to overcome limitations
such as global barriers and poor latency hiding [9], [10]
by embracing new ways of coordinating parallel execution,
controlling synchronization, and implementing latency hid-
ing utilizing Local Control Objects (LCO) [16], [18]. These
objects have the ability to create, resume, or suspend a
thread when triggered by one or more events. LCOs provide
traditional concurrency control mechanisms such as various
types of mutexes, semaphores, spinlocks, condition variables
and barriers in HPX. These objects improve the efficiency of
an application by permitting highly dynamic flow control as
they organize the execution flow, omit global barriers, and
enable thread execution to proceed as far as possible without
waiting. More details about LCO design and its performance
can be found in [10], [19], [20].
The two implementations of LCOs most relevant to this
research are the future construct and the dataflow template.
HPX provides a multi-threaded, message-driven, split-phase
transaction, and distributed shared memory programming
model using futures and dataflow based synchronization on
the large distributed system architectures, which are explained
in the following sections.
A. Future
future is a computational result that is initially unknown
but becomes available at a later time [11]. The goal of using
future is to let every computation proceed as far as possible.
Using future enables threads to continue their executions
without waiting for the results of the previous steps to be
completed, which eliminates the implicit global barrier at the
end of the execution of an OpenMP parallel loop. future
based parallelization provides the rich semantics for exploiting
higher level parallelism available within each application that
may significantly improve its scalability.
Fig. 5: The principle of the operation of the future in HPX.
Thread 1 is suspended only if the results from locality 2 are not
readily available. Thread 1 accesses the future value by performing
future.get(). If results are available, Tread 1 continues to complete
the execution.
Fig. 6: A dataflow object encapsulates a function
F (in1, in2, ..., inn) with n inputs from different data
resources. As soon as the last input argument has been
received, the function F is scheduled for an execution.
Figure 5 shows the scheme of the future performance with
2 localities, where a locality is a collection of processing
units (PUs) that have access to the same main memory. It
illustrates that the other threads do not stop their progress
even if the thread, which waits for the value to be computed,
is suspended. Threads access a future value by performing
future.get(). When the result becomes available, the future
resumes all HPX suspended threads waiting for that value.
It can be seen that this process eliminates the global barrier
synchronizations, as only those threads that depend on the
future value are suspended. With this scheme, HPX allows
asynchronous execution of the threads.
B. Dataflow Object
dataflow object provides a powerful mechanism for manag-
ing data dependencies without the use of global barriers [21],
[8]. Figure 6 shows the schematic of a dataflow object, which
encapsulates a function F (in1, in2, ..., inn) with n future or
non-future inputs from different data resources. If an input is
a future, then the invocation of the function will be delayed.
Non-future inputs are passed through. A dataflow object waits
for a set of futures to become ready and as soon as the
last input argument has been received, the function F is
scheduled for the execution [19]. Because the dataflow object
us ing hpx : : l c o s : : l o c a l : : d a t a f l o w ;
us ing hpx : : u t i l : : unwrapped ;
/ / a u t o m a t i c a l l y r e t u r n s t h e argument a s a
f u t u r e
re turn d a t a f l o w ( unwrapped ( [ & ] ( da t a a , . . . ) {
/ / same as o r i g i n a l o p a r g d a t
re turn a r g ;
}
} ) , da t a a , . . . ) ;
Fig. 7: op arg dat is modified to create an argument as a future
that is passed to a function through op par loop shown in figure 2.
returns a future, its result can be fed to other objects in the
system including other dataflows. These chained futures, by
their nature, represent a dependency tree that automatically
generates an execution graph. This graph is executed by
the runtime system as each nodes dependencies are meet.
As a result, dataflow minimizes the total synchronization by
scheduling new tasks as soon as they can be run instead of
waiting for entire blocks of tasks to finish computation.
IV. IMPLEMENTING DATAFLOW IN OP2
In this section, the new method is proposed for parallelizing
loops generated with OP2, which is based on dataflow im-
plementation that solves the current challenges of OP2. In
this method, the OP2 API is modified in such a way that
op arg dat used in each loop in figure 2 produces an argument
as a future for dataflow object inputs. Figure 7 shows the
modified op arg dat, in which data a,... expressed at the last
line of the code invokes a function only once all of them
get ready. unwrapped is a helper function in HPX, which
unwraps the futures and passes along the actual results.
This implementation also generates an output argument as a
future and as a result, all of the arguments of each loop in
figure 2 are passed as a future to the kernel function through
op par loop.
A. Parallelizing Loops Using for each
Parallelizing loops and controlling chunk sizes are
implemented by using for each algorithm and persis-
tent auto chunk size as an execution policy respectively. In
figure 8, dataflow is implemented with for each for the loop in
figure 4, that aids to parallelize the outer loop. for each is one
of the HPX parallel algorithms that is able to automatically
control the chunk size during the execution by determining
number of the iterations to be run on each HPX thread.
Moreover, HPX is able to execute loops in sequential or
in parallel by applying execution policies, which are briefly
described in Table I [19]. The concept of the execution policy
developed in HPX is used to specify the execution restrictions
of the work items, in which calling with a sequential execution
policy makes the algorithm to be run sequentially and calling
with a parallel execution policy allows the algorithm to be run
in parallel [18].
Policy Description Implemented by
seq sequential execution Parallelism TS, HPX
par parallel execution Parallelism TS, HPX
par vec parallel and Parallelism TS
vectorized execution
seq(task) sequential and HPX
asynchronous execution
par(task) parallel and HPX
asynchronous execution
TABLE I: The execution policies implemented in HPX.
hpx : : s h a r e d f u t u r e<op dat>
o p p a r l o o p s a v e s o l n ( char c o n s t ∗ name ,
o p s e t s e t ,
hpx : : f u t u r e<op arg> arg0 ,
hpx : : f u t u r e<op arg> a rg1 )
{
us ing hpx : : l c o s : : l o c a l : : d a t a f l o w ;
us ing hpx : : u t i l : : unwrapped ;
/ / a u t o m a t i c a l l y r e t u r n s o u t p u t a s a f u t u r e
re turn d a t a f l o w ( unwrapped ([& s a v e s o l n ]
( op arg arg0 , op arg a rg1 ) {
.
.
.
auto r = b o o s t : : i r a n g e ( 0 , n b l o c k s ) ;
hpx : : p a r a l l e l : : f o r e a c h ( p o l i c y ,
r . b e g i n ( ) , r . end ( ) ,
[&] ( s t d : : s i z e t b l o c k I d x ) {
f o r ( i n t n= o f f s e t b ; n<o f f s e t b +nelem ; n ++)
{
.
.
.
s a v e s o l n ( . . . ) ;
}
re turn a rg1 ;
} ) , arg0 , a rg1 ) ;
}
Fig. 8: Implementing for each within dataflow for the loop paral-
lelization in OP2 for the loop in figure 4. It makes the invocation
of a loop asynchronous by returning output as a future. dataflow
allows automatically creating execution graph, which represents a
dependency tree.
p qo ld =
o p p a r l o o p s a v e s o l n ( ” s a v e s o l n ” , c e l l s ,
o p a r g d a t ( p q ,−1 ,OP ID ,
4 , ” do ub l e ” ,OP READ) ,
o p a r g d a t ( p qold ,−1 ,OP ID ,
4 , ” do ub l e ” ,OP WRITE) ) ;
Fig. 9: Airfoil.cpp is changed while using dataflow for the loop
parallelization in OP2. p qold is returned as a future from each
kernel function after calling op par loop.
Fig. 10: The proposed method makes OP2 able to
interleave thses two loops together by passing p qold
output of op par loop save soln as an input argument for
op par loop update.
Fig. 11: dataflow provides a way for interleaving execution of
different loops together by generating output as a future and
passing all inputs as futures as well.
Figure 8 also illustrates that arg0 and arg1, which are
created as a future with op arg dat using p q and p qold
respectively, are passed as a future within a loop. This loop
will be executed only if these arguments get ready. Then, the
output argument, which is arg1 in this example, is passed as
a future to the outside of the loop and it is stored within
p qold shown in figure 9. This method is implemented to all
of the loops in figure 2, and as a result, each kernel function
returns an output argument as a future. The loop execution
may depend on the results of the other previous loops. So
by this method, the results of the loops can be passed as
future inputs to the other loops, which makes OP2 able to
interleave different loops. For example, p qold value updated
in op par loop save soln is used as an input argument for
op par loop update as shown in figure 10, which using this
proposed method makes it able to interleave this two loops
together by passing output of op par loop save soln as an
input argument for op par loop update.
Figure 11 shows generally that by implementing proposed
method, the future output of each loop passed as an input of
the other loops makes OP2 able to interleave different loops
together at runtime. As a result, if the loops are not dependent
on each other, they can be executed without waiting for the
previous loops to complete their tasks, however, if they depend
on the parameters from the previous loops, they will wait until
the previous loops complete their processes. This proposed
(a) Chunk sizes with different ex-
ecution time
(b) Chunk sizes with the same
execution time
Fig. 12: Setting chunk sizes of different dependent loops based on
each other.
method removes the unnecessary barrier synchronizations be-
tween different loops and execute them asynchronously.
B. Controlling Chunk Sizes
As it is explained in section IV-A, figure 11 shows how
dataflow provides a way of interleaving execution of different
loops together. In a case of having dependent loops, the
execution of each chunk in a loop depends on the execution of
the chunks in the previous loop. By using par as an execution
policy, different chunks with different execution time regard-
less of the chunk sizes of the other loops are determined for
each loop shown in figure 12a, which may increase the waiting
time between them. So for decreasing this waiting time, the
execution time of each chunk in these dependent loops should
be the same. For this purpose, the new execution policy is
proposed in this section, named persistent auto chunk size,
that makes all chunk sizes of different loops having same
execution time as shown in figure 12b. In this policy, the
chunk size of the first loop is determined automatically with
for each algorithm. Then the chunk sizes of each second and
third loops are determined based on the execution time of the
chunk in the first loop. As a result, all chunks of all these
three loops will have the same execution time. It should be
note that chunk1, chunk2 and chunk3 have different sizes but
with the same execution time.
V. HPX DATA PREFTECHER
Data prefetching is one of the methods for reducing memory
accesses latencies by calling data required for the next step into
the cache [22]. The simplest form of the cache prefetching
can be implemented by prefetching cache line of the next
iteration as soon as the current cache line is referenced [23],
[24]. Hardware, software and thread prefetching are different
traditional techniques for this purpose.
Various hardware prefetching methods has been proposed
that one of them is using one-block-lookahead (OBL) scheme
[25]. In this method, the blocks i + 1, i + 2, ..., and i + n
are prefetched whenever the block i is brought to the cache
that results in reducing cache misses significantly. Creating
reference prediction table [26], [27] is another method to limit
unnecessary prefetching and to predict the future memory
references. However, one of the big challenges exists in most
of these hardware prefetching methods is that the prefetcher
uses the past access pattern by considering data stream, which
cannot handle an irregular access pattern.
In the software prefetcher method, the prefetching data
is implemented by using prefetch directives in the code.
One of the problem of this method is that these prefetching
instructions are inserted with programmer or compiler into
the applications, which has the high probability of the cache
miss occurrences. Another problem is introducing additional
overhead for executing these prefetch instructions. There has
been many developments proposed for optimizing this tech-
nique that mostly are obtained by prefetching pointer-based
data structures [25], [28]. Mowry’s algorithm [29] is one of the
recent prefetching optimization that defines the affine array-
references as the prefetching candidates within an inner-most
loop, performs the loop unrolling, and creates the multiple
memory references within a loop. As a result, the exact
missing instance is prefetched, which avoids the unnecessary
prefetching and reduces prefetching overheads. Jump pointer
prefetching [26], [27] is another proposed software prefetch-
ing approach, which is implemented by inserting additional
pointers into a dynamic data structure for connecting non-
consecutive elements within a loop. This technique allows
prefetching data by creating pointer chain and results in over-
lapping fetching process of multiple elements simultaneously.
However, this technique also has the difficulty in handling
sequences of the irregular data accesses [23].
Thread based prefetching method is usually preferred over
the software / hardware prefetching methods, since it precom-
putes the load addresses accurately and it is able to follow
more complex patterns compared to the other methods [24].
This technique executes an application in the prefetcher thread
context and brings data of the next cache line into the shared
cache before the main thread accesses it. However, the scaling
can be degraded with this method because of
1) cache misses: the prefetcher could make slower progress
than the main thread, and
2) global barriers: a global barrier is needed to synchronize
the prefetcher with the main thread [22], [24], [25].
In this section, the new prefetching method is introduced
in HPX that combines a thread based prefetching with an
asynchronous task execution. The main goal of this method
is not only to reduce the memory accesses latencies, but also
to relax the global barriers, which results in a better parallel
performance.
Figure 13 shows the scheme of using future and the
proposed prefetching iterator, which makes HPX to have
the asynchronous execution while prefetching data of all the
Fig. 13: Data of the next iteration step is prefetched into the cache
memory with the prefetching iterator called in each iteration within
the for each
auto c t x =hpx : : p a r a l l e l : : m a k e p r e f e t c h e r c o n t e x t
(
l o o p r a n g e . b e g i n ( ) , l o o p r a n g e . end ( ) ,
p r e f e t c h d i s t a n c e f a c t o r ,
c o n t a i n e r 1 , c o n t a i n e r 2 , . . . , c o n t a i n e r n ) ;
hpx : : p a r a l l e l : : f o r e a c h ( p o l i c y ,
c t x . b e g i n ( ) , c t x . end ( ) ,
[&] ( s t d : : s i z e t i )
{
c o n t a i n e r 1 [ i ] = . . . ;
c o n t a i n e r 2 [ i ] = . . . ;
.
.
.
c o n t a i n e r n [ i ] = . . . ;
} ) ;
Fig. 14: The prefetching method used in for each. The prefetching
iterator in for each is called by using ctx begin, which is the struct
that references to all container in the loop.
containers within a loop of the next step in to the cache
memory in each iteration. Moreover, HPX is able to prefetch
data in sequential or in parallel with applying execution policy
described in Table I. This method is added to the method
explained in section IV-A to decrease the memory access
latencies while parallelizing loops.
Figure 14 shows the details of the prefetching method
implementation within for each. The program execution is
divided into several chunks within for each and its iterator
is developed to prefetch data of the next chunk size in either
sequential or in parallel. The prefetching iterator is initialized
with calling constructer of make prefetcher context and it
is executed by using ctx.begin(), which is the struct that
references to all containers used in a loop and loop range
is the range, in which the loop is executed. One of the feature
Fig. 15: Comparison results of the execution time between dataflow
and #pragma omp parallel for used for an Airfoil application.
of this prefetcher is that it works with any data types even
in a case of having different type for each container. The
distance between each two prefetching operations is computed
based on the value of prefetch distance factor. In order to
increase the effectiveness of the prefetcher and to decrease
the relative cost, prefetch distance factor is designed to be
determined based on the length of the cache line. As a
result, within each prefetcher distance, data of all containers
of the next time step are prefetched in each iteration by
calling this prefetching iterator. The experimental results of
optimizing OP2 performance with HPX discussed in this paper
are presented in the next section.
VI. EXPERIMENTAL RESULTS
In this section, we evaluate the experimental results of our
work by comparing our proposed framework to OP2’s current
design. The main goal of this section is to illustrate that dy-
namic information obtained at runtime and static information
obtained at compile time are both necessary to provide suf-
ficient optimizations for optimal performance. The proposed
methods studied in the previous sections are evaluated here.
The experiments are executed on the test machine with two
Intel Xeon E5-2630 processors, each with 8 cores clocked at
2.4GHZ and 65GB. Hyper-threading is enabled. The OS used
by the shared memory system is 32 bit Linux Mint 17.2. and
HPX 0.9.99 is used here.
A. Asynchronous Task Execution Provided with Dataflow
Figure 15 shows the execution time of an Airfoil appli-
cation using #pragma omp parallel for and dataflow, which
illustrates that HPX and OpenMP has approximately the same
performance on 1 thread. We are however able to improve
parallel performance in using dataflow for more number of
threads. For the speedup analysis, we use strong scaling,
for which the problem size is kept the same as the number
of cores are increased. Figure 16 shows the strong scaling
comparison results that illustrates a 33% better performance
for dataflow due to the asynchronous task execution, the use of
futures, and interleaving different dependent loops together.
As described in section III, dataflow automatically generated
Fig. 16: Comparison results of the strong scaling between dataflow
and #pragma omp parallel for used for an Airfoil application. This
comparison result illustrates a better performance for dataflow for
a larger number of threads, which is due to the asynchronous task
execution. dataflow automatically generates an execution tree, which
represents a dependency graph and allows an asynchronous execution
of the functions. Hyperthreading is enabled after 16 threads.
Fig. 17: Comparison results of strong scaling using dataflow with-
/without setting chunk sizes of different dependent loops based on
each other. Hyperthreading is enabled after 16 threads.
an (implicit) execution tree, which represents a dependency
graph that results in removing unnecessary global barriers and
improving scalability of the parallel applications.
1) Controlling Chunk Sizes: In this section, the chunk
sizes of different loops are set by considering chunk sizes
determined in the previous loops. Since dataflow enables
the compiler to interleave different loops together, the ex-
ecution of each chunk in each loop depends on the exe-
cution of the chunks in the previous loops. So using per-
sistent auto chunk size makes the execution time of each
chunks in these loops to be the same, which decreases the
waiting time between them. Figure 17 shows the improvement
in the performance of dataflow method by using persis-
tent auto chunk size as an execution policy within the loops.
For an instance, with 32 threads, the improvement is obtained
by about 40%.
For further parallelization performance improvement, data
prefetching proposed in section V is implemented in the
dataflow method and its results are evaluated in the next
section.
Fig. 18: Comparison results of a dataflow performance by using
proposed prefetching method. It shows that the speedup is increased
by around 45% with prefetching data within a loop. Hyperthreading
is enabled after 16 threads.
Fig. 19: The data transfer rate of implementing hpx::for each using
standard random access iterator versus prefetching iterator within a
dataflow. Hyperthreading is enabled after 16 threads.
Fig. 20: The data transfer rate of using prefetching iterator for
different prefetching distances. Hyperthreading is enabled after 16
threads.
B. Prefetching Data
The proposed prefetching method is applied on the dataflow
method and its performance is shown in figure 18. This
method takes advantage of the asynchronous execution while
prefetching data within a loop of the next step in to the cache
memory in each iteration step. These results illustrate that the
parallel performance of for each is improved by an average
of 45%, which confirms the successful process of avoiding
cache misses with implementing HPX prefetcher iterator. The
bandwidth rate comparison of these results are also shown in
figure 19.
The results of the parallel performance of the prefetching
iterator measurements with different prefetch distance factor
are shown in figure 20. It can be seen that for the very
large distances, data prefetching cannot improve the paral-
lel performance. On the other hand, very small prefetcher
distances causes more data to be prefetched, which be-
comes more expensive. This cost dominates the gains
from prefetching and impedes scaling. It is illustrated that
prefetch distance factor = 15 for an Airfoil application
improves the parallel performance significantly. These results
show the good scalability achieved by HPX and indicates that
it has the potential to continue to scale on a larger number of
threads.
VII. CONCLUSION
In this research, we present an implementation of the OP2
compiler that employs HPX runtime techniques to efficiently
and automatically parallelize unstructured grid applications to
achieve desired parallel scalability. The results illustrate that
using both dynamic information provided at runtime and the
static information provided at compile time are necessary to
obtain a higher parallelism level in the applications.
In the proposed framework, OP2 is able to automatically
produce data dependencies based on arguments that are passed
into the loops at compile time and, by using HPX parallelism
methods, the generated loops can be executed asynchronously.
In this framework, we propose different optimization methods
that make OP2 execute tasks asynchronously, interleave dif-
ferent loops together, efficiently control the chunk sizes of
different dependent loops based on each other, and prefetch
data into the cache before its actual access. These proposed
methods improved the overall performance of an Airfoil
application by 40− 50%.
In future research, we plan to improve runtime optimizations
with information from the compiler. Since runtime information
is often speculative, solely relying on it doesn’t guarantee
maximizing parallelization performance. In general, the
parallelization performance of an application depends on the
values measured at runtime and the related transformations
such as loop skewing and loop scheduling performed at
compile time. Collecting the outcome of the static analysis
performed by the compiler could significantly improve the
runtime performance.
Acknowledgements
We would like to thank Adrian Serio from Center for
Computation and Technology at Louisiana State University
for the invaluable and helpful comments and suggestions to
improve the quality of the paper. This works was supported
by NSF awards 1447831.
REFERENCES
[1] GA Mudalige, MB Giles, I Reguly, C Bertolli, and PHJ Kelly. Op2: An
active library framework for solving unstructured mesh-based applica-
tions on multi-core and many-core architectures. In Innovative Parallel
Computing (InPar), 2012, pages 1–12. IEEE, 2012.
[2] Mike B Giles, Gihan R Mudalige, Zohirul Sharif, G Markall, and
Paul HJ Kelly. Performance analysis and optimization of the op2
framework on many-core architectures. The Computer Journal, page
bxr062, 2011.
[3] GR Mudalige, MB Giles, B Spencer, C Bertolli, and IZ Reguly.
Designing op2 for gpu architectures. Journal of Parallel and Distributed
Computing, 2012.
[4] Carlo Bertolli, Adam Betts, Paul HJ Kelly, Gihan R Mudalige, and
Mike B Giles. Mesh independent loop fusion for unstructured mesh
applications. In Proceedings of the 9th conference on Computing
Frontiers, pages 43–52. ACM, 2012.
[5] Devang Patel and Lawrence Rauchwerger. Implementation issues of
loop-level speculative run-time parallelization. In International Confer-
ence on Compiler Construction, pages 183–197. Springer, 1999.
[6] Lawrence Rauchwerger, Nancy M Amato, and David A Padua. A
scalable method for run-time loop parallelization. International Journal
of Parallel Programming, 23(6):537–576, 1995.
[7] Lawrence Rauchwerger, Nancy M Amato, and David A Padua. Run-
time methods for parallelizing partially parallel loops. In Proceedings
of the 9th international conference on Supercomputing, pages 137–146.
ACM, 1995.
[8] Thomas Heller, Hartmut Kaiser, Andreas Scha¨fer, and Dietmar Fey.
Using HPX and LibGeoDecomp for scaling HPC applications on het-
erogeneous supercomputers. In Proceedings of the Workshop on Latest
Advances in Scalable Algorithms for Large-Scale Systems, page 1. ACM,
2013.
[9] T Heller, H Kaiser, and Klaus Iglberger. Application of the ParalleX
execution model to stencil-based problems. Computer Science-Research
and Development, 28(2-3):253–261, 2013.
[10] Patricia Grubel, Hartmut Kaiser, Jeanine Cook, and Adrian Serio. The
Performance Implication of Task Size for Applications on the HPX
Runtime System. In Cluster Computing (CLUSTER), 2015 IEEE
International Conference on, pages 682–689. IEEE, 2015.
[11] Henry C Baker Jr and Carl Hewitt. The incremental garbage collection
of processes. In ACM Sigplan Notices, volume 12, pages 55–59. ACM,
1977.
[12] Hartmut Kaiser, Thomas Heller, Bryce Adelstein-Lelbach, Adrian Serio,
and Dietmar Fey. HPX: A Task Based Programming Model in a Global
Address Space. In Proceedings of the 8th International Conference on
Partitioned Global Address Space Programming Models, page 6. ACM,
2014.
[13] Carlo Bertolli, Adam Betts, Gihan Mudalige, Mike Giles, and Paul
Kelly. Design and performance of the op2 library for unstructured mesh
applications. In Euro-Par 2011: Parallel Processing Workshops, pages
191–200. Springer, 2011.
[14] Mike B Giles, Gihan R Mudalige, Z Sharif, G Markall, and Paul HJ
Kelly. Performance analysis of the op2 framework on many-core
architectures. ACM SIGMETRICS Performance Evaluation Review,
38(4):9–15, 2011.
[15] MB Giles, D Ghate, and MC Duta. Using automatic differentiation for
adjoint cfd code development. 2005.
[16] Zahra Khatami, Hartmut Kaiser, and J Ramanujam. Using hpx and
op2 for improving parallel scaling performance of unstructured grid
applications. In Parallel Processing Workshops (ICPPW), 2016 45th
International Conference on, pages 190–199. IEEE, 2016.
[17] LA Smith. Mixed mode MPI/OpenMP programming. UK High-End
Computing Technology Report, pages 1–25, 2000.
[18] Thomas Heller, Hartmut Kaiser, Patrick Diehl, Dietmar Fey, and
Marc Alexander Schweitzer. Closing the performance gap with modern
c++. In International Conference on High Performance Computing,
pages 18–31. Springer, 2016.
[19] Hartmut Kaiser, Thomas Heller, Daniel Bourgeois, and Dietmar Fey.
Higher-level parallelization for local and distributed asynchronous task-
based programming. In Proceedings of the First International Workshop
on Extreme Scale Programming Models and Middleware, pages 29–37.
ACM, 2015.
[20] Chirag Dekate. Extreme Scale Parallel NBody Algorithm with Event
Driven Constraint Based Execution Model. PhD thesis, Citeseer, 2011.
[21] Patricia Grubel, Hartmut Kaiser, Kevin Huck, and Jeanine Cook. Using
intrinsic performance counters to assess efficiency in task-based parallel
applications.
[22] Jamison Collins, Suleyman Sair, Brad Calder, and Dean M Tullsen.
Pointer cache assisted prefetching. In Proceedings of the 35th annual
ACM/IEEE international symposium on Microarchitecture, pages 62–73.
IEEE Computer Society Press, 2002.
[23] Ilya Ganusov and Martin Burtscher. Efficient emulation of hardware
prefetchers via event-driven helper threading. In Proceedings of the
15th international conference on Parallel architectures and compilation
techniques, pages 144–153. ACM, 2006.
[24] Jaejin Lee, Changhee Jung, Daeseob Lim, and Yan Solihin. Prefetching
with helper threads for loosely coupled multiprocessor systems. Parallel
and Distributed Systems, IEEE Transactions on, 20(9):1309–1324, 2009.
[25] Abdel-Hameed Badawy, Aneesh Aggarwal, Donald Yeung, and Chau-
Wen Tseng. The efficacy of software prefetching and locality op-
timizations on future memory systems. Journal of Instruction-Level
Parallelism, 6(7), 2004.
[26] David Callahan, Ken Kennedy, and Allan Porterfield. Software prefetch-
ing. In ACM SIGARCH Computer Architecture News, volume 19, pages
40–52. ACM, 1991.
[27] Kavita Bala, M Frans Kaashoek, and William E Weihl. Software
prefetching and caching for translation lookaside buffers. In Proceedings
of the 1st USENIX conference on Operating Systems Design and
Implementation, page 18. USENIX Association, 1994.
[28] Daniel F Zucker, Ruby B Lee, and Michael J Flynn. Hardware and
software cache prefetching techniques for mpeg benchmarks. IEEE
Transactions on Circuits and Systems for Video Technology, 10(5):782–
796, 2000.
[29] John L Hennessy and David A Patterson. Computer architecture: a
quantitative approach. Elsevier, 2011.
