Impact of the memory hierarchy on shared memory architectures in multicore programming models by Badia Sala, Rosa Maria et al.
Impact of the memory hierarchy on shared memory architectures in multicore
programming models
Rosa M. Badia, Josep M. Perez, Eduard Ayguade´ and Jesus Labarta
Barcelona Supercomputing Center and Universitat Polite`cnica de Catalunya
Barcelona, SPAIN
{rosa.m.badia, josep.m.perez, eduard.ayguade, jesus.labarta}@bsc.es
Abstract
Many and multicore architectures put a big pressure in
parallel programming but gives a unique opportunity to
propose new programming models that automatically exploit
the parallelism of these architectures. OpenMP is a very well
known standard that exploits parallelism in shared memory
architectures. SMPSs has recently been proposed as a task
based programming model that exploits the parallelism at the
task level and takes into account data dependencies between
tasks. However, besides parallelism in the programming, the
memory hierarchy impact in many/multi core architectures
is a feature of large importance. This paper presents an
evaluation of these two programming models with regard
to the impact of different levels of the memory hierarchy
in the duration of the application. The evaluation is based
on tracefiles with hardware counters on the execution of a
memory intensive benchmark in both programming models.
Keywords: SMP Superscalar, programming models for
multicore, task scheduling, locality exploitation
1. Introduction
The new trends in computer fabrication have evolved
towards machines at all levels (from customer to HPC large
systems) with multicore homogeneous or heterogeneous
chips. For this reason it is now, more than ever, true that
there is a need for parallel programming models that enable
to easily exploit the possibilities of these chips. There are
several programming models that have been proposed with
this objective. OpenMP [1] is a very well known standard
that exploits parallelism in shared memory architectures. The
recent version 3.0 [2] extends its functionality to include the
support of task level parallelism. In a similar way, SMP Su-
perscalar (SMPSs) [3] has recently been proposed as a task
based programming model that exploits the parallelism at the
task level and takes into account data dependencies between
tasks. Both will be the focus in this paper. While previous
studies [3], [4] have focused more on the performance
achieved by the corresponding runtimes, this paper presents
an evaluation of these two programming models with regard
to the impact of different levels of the memory hierarchy in
the duration of the application tasks. The evaluation is based
on tracefiles with hardware counters on the execution of a
set of benchmarks in both programming models.
The paper structure is as follows: section 2 outlines
the main characteristics of SMPSs and OpenMP, section 3
outlines some related work, section 4 presents experimental
results with the STREAM benchmark, section 5 outlines a
new mechanism to further improve the locality exploitation
in SPMSs and finally section 6 concludes the paper.
2. Task Based Programming Models
Since the focus of this paper is SMPSs and OpenMP the
next two subsections present more detailed summary of these
two programming models.
2.1. SMP Superscalar
Star Superscalar (StarSs) is a family of programming
models (SMPSs [3], CellSs [5] and others). This program-
ming model is inspired by the behavior of superscalar
processors that are able to execute more than one instruction
during a clock cycle by simultaneously dispatching multiple
instructions to redundant functional units on the processor.
For StarSs the unit of execution is not the instruction but a
task, and a task is a function in the code without collateral
effects (only variables and parameters are accessed) and
with enough grain (this may depend on the final target
architecture where StarSs is implemented, and can be tuned
by applying blocking for example). The basic idea behind
StarSs is that tasks are defined by a sequential program and
at runtime a Directed Acyclic Graph (DAG) is built where
each of the nodes of the DAG represent a task and edges
represent data precedences that must be respected and that
are automatically detected by the runtime library.
Tasks are identified by pragma annotations of the type:
#pragma css task input (in_var1, in_var2)
output (out_var) inout (in_out_var)
where input, output and inout denotes the direction of the
parameters of the task. The case of study in this paper is
SMP Superscalar (SMPSs), that targets homogeneous mul-
ticore processors or shared memory machines. The current
Parallel, Distributed and Network-based Processing
1066-6192/09 $25.00 © 2009 IEEE
DOI 10.1109/.55
437
PDP.2009.56
runtime is implemented in such a way that for an application
that runs with N threads, the first thread (main thread)
executes both the main sequential program and tasks (if
time is left) and the other threads (worker threads) only
execute tasks. The scheduling of each task is determined
partially by the order in which the tasks are called, partially
by the data dependences, and partially by the thread that
executes the last predecessor of the task, with the objective
of scheduling in the same core dependant tasks. This favors
locality exploitation since a dependence denotes that the
predecessor task is writing a piece of data that is going to
be read by the successor task.
There exist N+1 ready lists, one main ready list and one
additional for each of the threads (including one for the main
thread). Whenever a task is generated, the runtime looks for
data dependences between the new task and former tasks. If
these data dependences exist, the task is inserted in the DAG.
However, if the task does not hold any data dependency, then
it is ready for execution and it is inserted in the main ready
list. The behavior of the worker threads is that they always
consume tasks from their ready list unless this is empty, then
they consume tasks from the main ready list, and if this is
also empty the threads will try to steal tasks from the ready
lists of other workers. Threads consume tasks from their own
ready list in LIFO order, consume tasks from the main ready
list and from other threads’ list in FIFO order. Whenever a
task is finished, the threads update the corresponding data
structures in the DAG and if the task completion has released
all the remaining data dependences of one or more tasks,
those are inserted in the thread ready list.
2.2. OpenMP
OpenMP was born in the 1990s with the objective of
bringing a standard to the different directive languages
defined by a community of vendors. Thanks to a set of
characteristics: simplicity of the interface, use of a shared
memory model, and use of loosely-coupled directives to
express the parallelism of a program, it is very well-accepted
today. OpenMP is based on the insertion of directives in
the sequential source code that give hints to the runtime
library about the existent parallelism in the application. The
OpenMP pragma annotation denoting a parallel loop is as
follows:
#pragma omp parallel for
Version 3.0 of OpenMP includes a tasking model that
fills a gap with regard the ways of expressing parallelism
in an application. With the new OpenMP directives, the
programmers can identify units of independent work (tasks),
leaving the decision to how and when to execute them to
the runtime system. This gives the programmers a way
of expressing patterns of concurrency that do not match
the worksharing constructs defined in the OpenMP 2.5
specification.
3. Related work
One of the task-based programming models is Cilk [6], a
general-purpose programming language designed for multi-
threaded parallel programming. In Cilk, the programmer is
responsible of exposing the application parallelism, identi-
fying sections of the code (tasks) that can safely be executed
in parallel. Tasks are invoked with the spawn keyword
and the sync keyword is used to wait until all previously
spawned tasks have completed. Cilk supports recursivity
at the task level (tasks generate new tasks) but does not
support automatic data dependence detection between them.
Therefore, data dependences have to be controlled by the
programmer with the help of the sync keyword. The runtime,
in particular the scheduler, decides how to actually divide
the work between processors. The work-stealing approach
followed by the Cilk scheduler has been designed in such
a way that naturally exploits the existent data locality, in
particular for the recursive tasks.
Cilk initially only supported parallel tasks, however,
Cilk++ also supports parallel loops. OpenMP evolution is
just the opposite: initially supported parallel loops, while the
last version 3.0 also supports parallel tasks. Both systems
support also recursivity at the task level. SMPSs does
not support parallel loops, but supports task parallelism,
although does not support the recursivity at the task level.
The main difference between SMPSs and the previous two
approaches is that SMPSs automatically detects the task
dependences building a task DAG.
Besides, there have been steps towards the integration of
task precedence [7] and task dependence [8] in OpenMP.
With regard to related work on studies the behavior of ap-
plications in ccNUMA shared memory systems, [9] presents
the results of cache and memory performance studies on
an SGI Altix 350. In [10] the authors present an study of
the performance obtained (with relation to the ccNUMA
memory) in a Sun Fire Server. The paper also proposes son
performance tunings that improve up to 30% the application
performance. In [11] the authors present the evaluation of
the SARC programming model on the Cell/BE architecture
using the benchmarks STREAM and RandomAccess.
4. Experiments
The first part of the paper is focused in the description
of the STREAM benchmark from the HPC Challenge col-
lection [12]. We present alternative implementations to the
original OpenMP one and an SMPSs implementation of this
benchmark and an analysis of the results.
4.1. STREAM implementation in OpenMP
STREAM [12] is a simple synthetic benchmark program
that measures sustainable memory bandwidth and the corre-
sponding computation rate for a simple vector kernel.
438
void tuned STREAM Copy ( )
{
i n t j ;
#pragma omp p a r a l l e l f o r
for ( j =0 ; j<N; j ++)
c [ j ] = a [ j ] ;
}
Figure 1: Original copy function in STREAM
void tuned STREAM Copy ( )
{
i n t j ;
i n t i ;
#pragma omp f o r s c h e d u l e ( s t a t i c , BSIZE )
f o r ( j =0 ; j<N; j ++)
c [ j ] = a [ j ] ;
}
Figure 2: Modified OpenMP version of copy function in
STREAM
The original STREAM implementation already considers
an OpenMP parallelization of four routines ((Copy, Scale,
Add and Triad), as it is shown in figure 1 for the original
Copy function in STREAM.
For the purposes of this paper we have considered an
alternative implementation as shown in figure 2, that mimics
the behavior of the SMPSs implementation explained in the
next section. This version statically distributes chunks of
BSIZE iterations to each of the threads.
Additionally, a parallel pragma is inserted in the main
program, as shown in figure 3. Besides, a dynamically
scheduled version as shown in 4 is also considered.
The original benchmark includes in its prelude a par-
#pragma omp p a r a l l e l p r i v a t e ( k )
f o r ( k =0; k<NTIMES ; k ++)
{
tuned STREAM Copy ( ) ;
tuned STREAM Scale ( s c a l a r ) ;
tuned STREAM Add ( ) ;
tuned STREAM Triad ( s c a l a r ) ;
}
Figure 3: Modified OpenMP version of main program in
STREAM
void tuned STREAM Copy ( )
{
i n t j ;
i n t i ;
#pragma omp f o r s c h e d u l e ( dynamic , BSIZE )
f o r ( j =0 ; j<N; j ++)
c [ j ] = a [ j ] ;
}
Figure 4: OpenMP version of copy function in STREAM
with dynamic scheduling
#pragma c s s t a s k i n p u t ( a ) o u t p u t ( c )
void c o p y t a s k ( double a [ BSIZE ] , double c
[ BSIZE ] )
{
i n t j ;
f o r ( j =0 ; j < BSIZE ; j ++)
c [ j ] = a [ j ] ;
}
void tuned STREAM Copy ( )
{
i n t j ;
f o r ( j =0 ; j<N; j +=BSIZE )
c o p y t a s k (&a [ j ] , &c [ j ] ) ;
}
Figure 5: Copy function in STREAM with SMPSs
allelized loop where the data arrays are initialized. This
initialization has been slightly modified to consider also
the static and dynamic scheduling of BSIZE chunks of
iterations. To compare with SMPSs, we will also include
results where the initialization is done sequentially.
4.2. STREAM implementation in SMPSs
For the SMPSs version, we encapsulated chunks of con-
secutive iterations of the loops of functions Copy, Scale,
Add and Triad into SMPSs tasks. Figure 5 shows the code
changes for function Copy.
The STREAM main code is a loop that calls the different
functions. To mimic the same behavior as in the original
benchmark, we have initially inserted explicit barrier syn-
chronizations between the calls to Copy, Scale, ... as shown
in figure 6. In this first version, all Copy tasks of one
iteration are performed first, then all Scale tasks, etc. This
not only mimics the original STREAM benchmark, but also
models how a naive task scheduling algorithm will schedule
439
f o r ( k =0; k<NTIMES ; k ++)
{
tuned STREAM Copy ( ) ;
#pragma c s s b a r r i e r
tuned STREAM Scale ( s c a l a r ) ;
#pragma c s s b a r r i e r
tuned STREAM Add ( ) ;
#pragma c s s b a r r i e r
tuned STREAM Triad ( s c a l a r ) ;
#pragma c s s b a r r i e r
}
Figure 6: Main loop of STREAM for SMPSs version
c [ j ] = a [ j ] ; / / Copy
b [ j ] = s c a l a r ∗c [ j ] ; / / S c a l e
c [ j ] = a [ j ]+ b [ j ] ; / / Add
a [ j ] = b [ j ]+ s c a l a r ∗c [ j ] ; / / T r i a d
Figure 7: Actual operations in STREAM for each element
of the array
tasks in SMPSs. Such a naive scheduler will have a single
ready list and the ready tasks will be inserted in this list as
soon as all the data dependences are released, and consumed
by the threads in FIFO order.
However, since SMPSs automatically detect the depen-
dences between the tasks, another version is possible, elim-
inating all the barrier synchronizations. This somehow per-
verts the original objectives of the benchmark, but enables
us to measure how the current scheduling implemented in
the SMPSs runtime exploits the locality of the application.
If we look at the basic code executed for each element of
the vector we have the code listed in figure 7.
In this code we can see that the data dependences define
the following task precedence inside the iterations Copy -
Scale - Add - Triad, and additionally between iterations, a
task precedence between the Triad of one iteration is defined
with the Copy task of the next iteration. Therefore, with the
scheduling strategy currently implemented in SMPSs, for a
chunk of the array all the tasks operating on this chunk can
be executed in the same core, exploiting the locality of the
benchmark.
Additionally, to mimic the parallel initialization of data
that the OpenMP version is able to do, we have encapsulated
the initialization of the data arrays in a task (each task
initializes a chunk of BSIZE elements of the arrays).
Figure 8: Execution time of the STREAM benchmark with
sequential initialization (8 processors)
4.3. Execution environment
The results presented in this section have been executed
in an SGI Altix 4700 at BSC with 128 core (32 nodes,
with 2 dual-core processors) and a total of 1TB of memory.
We have run SMPSs versions of the experiments with
SMPSs version 2.0 and extracted Paraver [13] tracefiles
with hardware counters (extracted with the support of PAPI
library version 3.5.0). The OpenMP versions have been
compiled with the native ICC compiler, version 10.0. This
compiler was also used as back-end compiler for the SMPSs
versions.
4.4. Versions’ Comparison
The first experiments we have run consisted in the mea-
surement of the execution time of all the aforementioned
versions. We ran the different examples changing the array
size, from 2 million elements to 321.93 million elements
(this is not an accidental number, since with this size the
total amount of memory used is 7368.4MB which is roughly
the 90% of memory available by each core of this machine).
With the objective of analyzing the impact of enlarging the
size of the chunks of data assigned to each thread (or task),
the chunks’ size used is the 1% of the total size of the array.
We have also changed the number of processors, running the
experiments with 8, 16 and 32 processors.
Figures 8, 9, and 10 show the comparison of the two
OpenMP versions (static and dynamic scheduling) against
the two SMPSs versions (with and without barriers) when
doing the data arrays are sequentially initialized.
When using 8 processors, the OpenMP with static
scheduling has the worst behavior, while the OpenMP
with dynamic scheduling and SMPSs with barriers perform
equally and the SMPSs without barrier slightly outperforms
the previous. With 16 processors, the results are very similar,
with the difference that now the OpenMP version with
dynamic scheduling is a bit better to the SMPSs version with
440
Figure 9: Execution time of the STREAM benchmark with
sequential initialization (16 processors)
Figure 10: Execution time of the STREAM benchmark with
sequential initialization (32 processors)
barriers. With 32 processors, although it is not clearly ob-
servable in the chart, besides the OpenMP version with static
scheduling that gets the worst results, the rest of versions
have the same behavior. Another important observation from
these charts is that no improvement in the total execution
time is observed when using more processors (none of the
cases scale with the number of processors).
Figures 11, 12, and 13 show the comparison of the two
OpenMP versions (static and dynamic scheduling) against
the two SMPSs versions (with and without barriers) when
doing a parallel initialization of the data arrays.
When the arrays are initialized in parallel the situation
changes since the physical memory in the SGI Altix is by
default allocated on a first touch basis. With 8 processors,
the chart looks similar to the situation when using sequential
initialization. However, the situation with 16 and 32 pro-
cessors looks very different, with the OpenMP with static
scheduling improving a lot and the OpenMP with dynamic
scheduling behaving worst when increasing the number of
processors. We would like to stress here that the SMPSs
version without barriers shows a more stable behavior in all
the cases, being the best or almost the best in all cases.
Figure 11: Execution time of the STREAM benchmark with
parallel initialization (8 processors)
Figure 12: Execution time of the STREAM benchmark with
parallel initialization (16 processors)
Figure 13: Execution time of the STREAM benchmark with
parallel initialization (32 processors)
441
Figure 14: Tasks’ scheduling in the SMPSs versions
4.5. Analysis of the results
This section reports more in-depth analysis of the trace-
files. The first analysis is done by comparing the four
cases when the data arrays are sequentially initialized. The
objective of these first analysis is to understand why any
of these versions is actually scaling with the number of
processors. The analysis is done comparing traces with 8,
16 and 32 processors.
Regarding the scheduling of the SMPSs versions when
analyzing the traces it is observed that in the version with
barriers all tasks of a type are executed one after the other
to preserve the barriers but in the version without barriers
tasks of different type are interleaved, preserving the data-
dependences and exploiting the locality of the chunks of
data. This is shown in figure 14: the x-axis represents the
timeline and each line corresponds to one thread. Different
colors correspond to different tasks (the small green flags
indicate the begin of a new task).
We first had a look to the average time to execute each of
the tasks in the different versions. This time is different for
each of the tasks’ type, as can be observed in figure 15. For
this case, with a relatively small data array size, the case
with barriers shows also a deviation in time between the
first four threads and the other four. This deviation is not
observed in the version without barriers since the locality is
overall better exploited. This is due to the fact that the arrays
are allocated and initialized in the sequential section of the
benchmark by the first thread. Therefore, we can assume
that the threads 1–4 are located in the first node and threads
5–8 in another node.
However, when analyzing the traces with a larger data
array size, this difference between the version with barriers
and without barriers is less evident. This can be observed in
figure 16 for the SMPSs versions, when a array size of 32
million of elements is used.
To understand this difference in task time we had a
look to the L3 and TLB miss ratio of the different cases.
Although these ratios did not show significant differences,
Figure 15: Average time per task and thread, with (top) and
without barriers (bottom), using 8 processors and data array
size of 2 million elements
the bandwidth obtained by each of the threads in the different
cases reflects the same difference, as can be seen in figure 17.
In this figure, we see for the SMPSs version without barriers,
how the bandwidth with memory varies when we run the
benchmark with 8, 16 and 32 processors. The figure shows
not only the differences between the bandwidth obtained
in the first 4 processors (located in the first node) and the
others, but also how this difference is each time larger. In
the case of 32 processors, the average bandwidth in the
processors 4–31 is only 58.8 MB/s in this case. For the
rest of the examples (SMPSs with barriers and OpenMP)
the behaviour is similar and we do not show the results due
to space and redundancy reasons. Clearly, the fact that the
data arrays are initialized in one node creates a bottleneck
in this node, and therefore in these systems it is a good idea
to distribute the data initialization.
Another interesting fact that we wanted to understand
from figures 11 – 13 is the behavior of the OpenMP
442
Figure 16: Average execution time of tasks per thread with SMPSs
Figure 17: Bandwidth with memory for the SMPSs version
without barriers
static version. It is surprising that OpenMP with the static
scheduling obtained the worst results with 8 processors and
almost the best results with 32 processors. Given that the
static scheduling distributes linearly the chunks of iterations,
this version should be able to exploit the data locality.
Also, one would expect a uniform access time for all
processors. Looking at the tracefiles, no important imbalance
was find is the static version, although some processors were
executing one chunk of iterations more. Therefore, it was
very surprising that although the L3 and TLB miss ratio
indicators are perfectly balanced for all processors and were
less than the 50% in the static version than in the dynamic
version, the dynamic version was beating the static one with
8 processors.
We analysed the memory bandwidth obtained by the
static and dynamic scheduling versions, which is shown in
figures 18 and 19. While the static scheduling is able to
achieve more or less the same bandwidth in the three cases
(with 8, 16 and 32 processors), with the dynamic scheduling
the bandwidth obtained by the example is each time smaller.
This explains the improvement of the static version with re-
gard the dynamic version when higher number of processors
Figure 18: Bandwidth with memory for the OpenMP version
when using static scheduling
is used. However, it is still not clear why the dynamic case
is better than the static with 8 processors.
Since the arrays were split in chunks of data that are not
multiple of the memory page size, a possible explanation
would be that the first touch instantiation of a page does
not necessarily guarantees that a chunk is local in the
same module where the thread is running. We repeated the
example with new sizes (with chunks of size multiple of
the memory page size) and also with balanced load for all
the processors. In this case, the behavior of the example
improves significantly, as well as the bandwidth achieved
when using 8 processors as can be seen in figure 20. The
peak of the curve is now centered on 1220 MB/s. Since the
tracefiles with only one processor per node show that the
peak bandwidth achievable with this example is 5000 MB/s,
we can not expect much more with 4 processors per node.
Additionally, if all threads in the same node access memory
at the same time they create conflicts between them.
443
Figure 19: Bandwidth with memory for the OpenMP version
when using dynamic scheduling
Figure 20: Bandwidth with memory for the OpenMP version
when using static scheduling (impact of using data size
multiple of memory page)
5. Further exploiting the memory locality
As seen in the previous section, exploiting the memory
locality in ccNUMA based systems has a large impact in the
performance results and that a dynamic scheduling conscious
of the locality should be able to obtain good performance.
With the objective of further improving how SMPSs exploits
the data locality we have made a modification in the way
way tasks are assigned to the threads.
The existing mechanism in SMPSs exploits the locality
by executing sequences of data dependent tasks in the same
thread. However, the creation of these sequences of tasks can
be interrupted either by the SMPSs runtime graph-creation
mechanism (that has a threshold on the maximum number
of tasks in the graph for memory allocation reasons) or by
Figure 21: Execution time of the STREAM benchmark with
when changing array size (8 processors)
Figure 22: Execution time of the STREAM benchmark with
when changing array size (16 processors)
the program synchronization points (i.e. barriers).
The new implemented mechanism is able to remember
wich thread did the first touch of a given block of data.
Whenever a new task is added by the SMPSs runtime, if
this task is ready, instead of inserting it in the main ready
list is directly inserted (in FIFO order) in the ready list of
the thread that first touched the data accessed by the task.
Figures 21, 22, and 23 compare a new set of versions, both
for OpenMP and for SMPSs. First, the size of the arrays’
chunks is now a multiple of a memory page size in all cases.
Also, the number of chunks by wich we divide the array is
a multiple of the number of processors. Additionally, for
the static scheduling version, the ”nowait” clause is added
in the pragmas. However, this clause can not be inserted
when dynamic scheduling is used, since OpenMP does not
preserve the data dependences and the benchmark does not
validates. For the SMPSs versions, we changes as well
the sizes of the arrays and chunks of data. Two versions
are tested, which use the mechanism described above to
map tasks to threads: one that performs the initialization
of the data (in parallel) and a barrier is inserted after this
444
Figure 23: Execution time of the STREAM benchmark with
when changing array size (32 processors)
initialization and a second one that does not insert any
barrier.
A significant difference here is that the OpenMP version
with static scheduling and the nowait clause outperforms
the dynamic scheduling in all cases. Also, SMPSs versions
with the new memory affinity mechanism, show better per-
formance than the OpenMP cases for all processor counts.
6. Conclusions
The paper presents a comparison of how the benchmark
STREAM can be implemented with OpenMP and with
SMPSs and how the different scheduling mechanisms impact
in the achievable memory bandwidth. With OpenMP, an
evaluation of the impact in the performance in a memory
intensive benchmark of the static and dynamic scheduling
is presented. With SMPSs, results obtained with different
versions inserting barriers and without inserting them is
presented. For both OpenMP and SMPSs, the impact of
initializing the data sequentially is analyzed and clearly for
ccNUMA memory based systems this is a non-appropriate
option since this will allocate all the data in the same
memory module, creating a bottle-neck in the memory
accesses.
Besides, both OpenMP and SMPSs require scheduling
schemes that although conscious of the locality (and this
is somehow a quite static feature) are dynamic enough to
adapt to other sources of imbalance in the systems. More
specific for SMPSs, increasing the threshold of number of
tasks in the graph would enable to better exploit the temporal
locality that appears in computations far away in the original
source code.
Acknowledgments
The authors acknowledge the financial support of the
Comision Interministerial de Ciencia y Tecnologa (CICYT,
Contract TIN2007-60625) and the BSC-IBM MareIncognito
research agreement.
References
[1] cOMPunity. The community of OpenMP users,
researchers, tool developers and provider website.
http://www.compunity.org/, 2006.
[2] E. Ayguade´, N. Copty, A. Duran, J. Hoeflinger, Y. Lin, and
G. Zhang. A proposal for task parallelism in OpenMP. In
Proceedings of the 3rd International Workshop on OpenMP,
June 2006.
[3] J.M. Perez, R.M. Badia, and J.Labarta. A dependency-
aware task-based programming environment for multi-core
architectures. In Proceedings of IEEE Cluster Computing
2008, 2008.
[4] E. Ayguade´, A. Duran, J. Hoeflinger, F. Massaioli, and
X. Teruel. An experimental evaluation of the new openmp
tasking model. In Proceedings of the 20th International Work-
shop on Languages and Compilers for Parallel Computing,
2007.
[5] J. M. Perez, P. Bellens, R. M. Badia, and J. Labarta. CellSs:
Programming the Cell/B.E. made easier. IBM Journal of
Research and Development, 51(5), Aug 2007.
[6] M. Frigo, C. E. Leiserson, and K. H. Randall. The imple-
mentation of the cilk-5 multithreaded language. SIGPLAN
Notices, 33(5):212–223, 1998.
[7] M. Gonzalez, E. Ayguade´, X. Martorell, and J. Labarta.
Exploiting pipelined executions in OpenMP. In Proceedings
of the 32nd Annual International Conference on Parallel
Processing, pages 153–160, Oct 2003.
[8] A. Duran, J.M. Perez, E. Ayguade, R.M. Badia, and
J. Labarta. Extending the OpenMP tasking model to allow
dependent tasks. In Proceedings of the 4th International
Workshop on OpenMP, 2008.
[9] G. Juckeland, M.S. Muller, W.E. Nagel, and S Pflu. Accessing
data on sgi altix: An experience with reality. In Proceedings
of WMPI 2006, 2006.
[10] A. Kayi, E. Kornkven, T. El-Ghazawi, and G. Newby. Ap-
plication performance tuning for clusters with ccnuma nodes.
In Computational Science and Engineering, 2008. CSE ’08.
11th IEEE International Conference on, pages 245–252, July
2008.
[11] R. Ferrer, M. Gonza´lez, F. Silla, X. Martorell, and
E. Ayguade´. Evaluation of memory performance on the cell be
with the sarc programming model. In Proceedings of MEDEA
workshop (PACT), 2008.
[12] HPCS. The hpc challenge benchmark.
http://icl.cs.utk.edu/hpcc/index.html.
[13] Jesu´s Labarta, Sergi Girona, Vincent Pillet, Toni Cortes,
and Luis Gregoris. DiP: A parallel program development
environment. In Proceedings of the 2nd International EuroPar
Conference (EuroPar 96), 1996.
445
