O algoritmo batched DOACROSS by Lucas, Divino César Soares, 1985-
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Divino César Soares Lucas
The Batched DOACROSS Algorithm
O Algoritmo Batched DOACROSS
CAMPINAS
2017
Divino César Soares Lucas
The Batched DOACROSS Algorithm
O Algoritmo Batched DOACROSS
Tese apresentada ao Instituto de Computação
da Universidade Estadual de Campinas como
parte dos requisitos para a obtenção do título
de Doutor em Ciência da Computação.
Dissertation presented to the Institute of
Computing of the University of Campinas in
partial fulfillment of the requirements for the
degree of Doctor in Computer Science.
Supervisor/Orientador: Prof. Dr. Guido Costa Souza Araújo
Este exemplar corresponde à versão final da
Tese defendida por Divino César Soares
Lucas e orientada pelo Prof. Dr. Guido
Costa Souza Araújo.
CAMPINAS
2017
Agência(s) de fomento e nº(s) de processo(s): Não se aplica. 
Ficha catalográfica
Universidade Estadual de Campinas
Biblioteca do Instituto de Matemática, Estatística e Computação Científica
Ana Regina Machado - CRB 8/5467
    
  Lucas, Divino César Soares, 1985-  
 L962b LucThe batched DOACROSS algorithm / Divino César Soares Lucas. –
Campinas, SP : [s.n.], 2017.
 
   
  LucOrientador: Guido Costa Souza Araújo.
  LucTese (doutorado) – Universidade Estadual de Campinas, Instituto de
Computação.
 
    
  Luc1. Processamento paralelo (Computadores). 2. Programação paralela
(Computação). 3. Processadores multicore. 4. Algoritmos paralelos. 5.
Compiladores (Computadores). I. Araújo, Guido Costa Souza, 1962-. II.
Universidade Estadual de Campinas. Instituto de Computação. III. Título.
 
Informações para Biblioteca Digital
Título em outro idioma: O algoritmo batched DOACROSS
Palavras-chave em inglês:
Parallel processing (Electronic computers)
Parallel programming (Computer science)
Multicore processors
Parallel algorithms
Compiling (Electronic computers)
Área de concentração: Ciência da Computação
Titulação: Doutor em Ciência da Computação
Banca examinadora:
Guido Costa Souza Araújo [Orientador]
Alexandro José Baldassin
Fernando Magno Quintão Pereira
Sandro Rigo
Márcio Machado Pereira
Data de defesa: 06-10-2017
Programa de Pós-Graduação: Ciência da Computação
Powered by TCPDF (www.tcpdf.org)
Universidade Estadual de Campinas
Instituto de Computação
INSTITUTO DE
COMPUTAÇÃO
Divino César Soares Lucas
The Batched DOACROSS Algorithm
O Algoritmo Batched DOACROSS
Banca Examinadora:
• Prof. Dr. Guido Costa Souza Araújo - Presidente
Universidade Estadual de Campinas
• Prof. Dr. Sandro Rigo
Universidade Estadual de Campinas
• Prof. Dr. Márcio Machado Pereira
Universidade Estadual de Campinas
• Prof. Dr. Alexandro José Baldassin
Universidade Estadual Paulista
• Prof. Dr. Fernando Magno Quintão Pereira
Universidade Federal de Minas Gerais
A ata da defesa com as respectivas assinaturas dos membros da banca encontra-se no
processo de vida acadêmica do aluno.
Campinas, 06 de outubro de 2017
Resumo
A paralelização de laços de programas contendo loop-carried dependences é considerada
uma tarefa bastante difícil, principalmente devido ao sobrecusto imposto pela comuni-
cação de dependências de dados entre iterações do laço paralelizado. Apesar do grande
empenho em décadas recentes para criar algoritmos efetivos para paralelização deste tipo
de laço o problema ainda é considerado longe de estar resolvido. Para muitos laços, antigos
algoritmos, como o DOACROSS, bem como novos algoritmos, como Decoupled Software
Pipeline (DSWP), não foram capazes de oferecer uma solução eficiente para o problema.
Esta tese discute em detalhes dois (DOACROSS e DSWP) dos algoritmos mais proe-
minentes para a paralelização de tais laços e também mostra uma análise do desempenho
de programas paralelizados usando estes algoritmos em diferentes arquiteturas de proces-
sadores. A partir desta análise surgiu o projeto de um novo algoritmo, chamado Batched
DOACROSS (BDX), para fazer a paralelização deste tipo de laço. Este algoritmo foi
pensado de forma a utilizar as melhores características destes algoritmos anteriores e ao
mesmo tempo evitar o uso de propriedades que se mostraram ineficientes no passado.
O algoritmo Batched DOACROSS não requer suporte de hardware (como é exigido por
DSWP) e faz uso de estruturas locais às linhas de execução para reduzir o sobrecusto com
sincronização entre elas. Uma extensão para o algoritmo BDX também é proposta, cha-
mada Parallel Stage Batched DOACROSS (PS-BDX), e os resultados indicaram que para
alguns casos esta extensão é capaz de produzir aumentos significativos de desempenho.
BDX e PS-BDX são algoritmos que transformam o laço serial para executar paralela-
mente seguindo um formato de pipeline, além disso estes algoritmos empregam a execução
de lotes de iterações para reduzir o custo de comunicação entre núcleos. Resultados de
análise de sensibilidade mostram que estes novos algoritmos são capazes de produzir bons
resultados até mesmo para laços pequenos (contendo em torno de 40 instruções) quando
configurados em lotes com ao menos 100 iterações.
A análise do algoritmo PS-BDX usando sete programas mostrou uma média de au-
mento de 1.85x no desempenho dos programas quando estes foram paralelizados usando 2
linhas de execução, 2.95x quando paralelizados com 4 linhas de execução e por fim 3.11x
quando estes programas foram paralelizados usando 8 linhas de execução. Em todos estes
casos o desempenho da versão paralelizada com PS-BDX foi melhor que o segundo melhor
algoritmo experimentado (PS-DSWP).
Uma análise quantitativa e qualitativa dos custo de sincronização em programas para-
lelizados usando os algoritmos acima também foi realizada. Os resultados indicaram que
em média 30% do tempo de execução dos programas paralelos é gasto com sincronização
de acesso a regiões críticas e a dados compartilhados. Também são mostrados resultados
que indicam que as arquiteturas de computadores Intel Ivy Bridge e ARM A9 MPCore
se encontram em extremos opostos em se tratando de requisitos para a paralelização de
laços. Em consequência disto todos os algoritmos analisados enfretam dificuldades para
melhorar o desempenho dos programas seriais.
Abstract
Parallelizing loops containing loop-carried dependences has been considered a very diffi-
cult task, mainly due to the overhead imposed by communicating dependences between
iterations. Despite the huge efforts in the past few decades to devise effective paralleliza-
tion algorithms for such loops, the problem is still far from solved. For many loops, old
DOACROSS, and new Decoupled Software Pipeline (DSWP), algorithms have not been
able to offer a solution to this problem.
This thesis discuss in detail two of the most prominent algorithms for parallelizing
such loops and also present an analysis of the performance of the parallelized programs
across different multicore architectures. Based on insights from this analyze a new al-
gorithm, called Batched DOACROSS, for parallelizing these loops is proposed. Batched
DOACROSS (BDX) capitalizes on the advantages of DSWP and DOACROSS, while min-
imizing their deficiencies. BDX does not require new hardware mechanisms (as DSWP
does) and makes use of thread local buffers to reduce DOACROSS synchronization over-
heads. An extension to the baseline algorithm is proposed, named Parallel-Stage Batched
DOACROSS (PS-BDX), and show that in some cases it can considerably improve the
performance of the parallel loop.
BDX and PS-BDX are pipelining multithreading algorithms that employs batching to
amortize communication overheads. We provide results for a sensibility analysis and show
that for small balanced loops (about 40 instructions), a batch size of only 100 iterations
is sufficient to provide good speedups. Our analyze of PS-BDX for seven benchmarks
showed an average of 1.85x speedup for 2 threads, 2.95x for 4 threads and 3.11x for 8
threads which was larger than the other best algorithm that we compared.
A qualitative and quantitative analysis of synchronization costs of the three aforemen-
tioned loop parallelization algorithms (BDX, DOACROSS and DSWP) is performed for
two modern computer architectures (ARM A9 MPCore and Intel Ivy Bridge). Our results
show that at least 30% of the execution time of the programs we parallelized are spent
on synchronization/data communication.
We also show that, besides the problem being hard, Intel Ivy Bridge and ARM A9
MPCore are on opposite endpoints along the axis of commonly accepted requisites for
efficient loop parallelization. As a consequence, all three algorithms struggle to effectively
speedup several programs.
List of Figures
2.1 Examples of DOACROSS and DOALL Loops . . . . . . . . . . . . . . . . 16
2.2 Examples of code using DOACROSS and DSWP . . . . . . . . . . . . . . 18
2.3 Inter-thread communication in BDX, DOACROSS and DSWP . . . . . . . 18
2.4 Source code for threads composing BDX parallelized example loop. . . . . 19
3.1 BDX Parallelized . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 BDX execution flow example. . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.1 Performance of BDX in function of batch size and stage size . . . . . . . . 44
4.2 Performance of BDX in function of loop-independent dependences . . . . . 45
4.3 Performance of BDX in function of loop-dependent dependences . . . . . . 46
4.4 Communication and computation execution time partitioning. . . . . . . . 47
4.5 ARM & Intel speedups for BDX and DOAX-U loops. . . . . . . . . . . . . 49
4.6 Number of core-to-core cache lines forwarded . . . . . . . . . . . . . . . . . 51
4.7 Speedup of BDX, DOAX, DOAX-U and DSWP for all programs . . . . . . 52
4.8 Application speedup of programs parallelized with PS-BDX and PS-DSWP. 54
List of Tables
2.1 Loop parallelization synchronization overhead . . . . . . . . . . . . . . . . 20
4.1 Target architectures detail. . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Characteristics of parallelized programs . . . . . . . . . . . . . . . . . . . . 40
4.3 Runtime information about the loops parallelized . . . . . . . . . . . . . . 41
4.4 Which algorithm is best for each type of loop . . . . . . . . . . . . . . . . 54
A.1 BDX speedup per batch and stage size. . . . . . . . . . . . . . . . . . . . . 66
B.1 BDX speedup per number of variables and uses. . . . . . . . . . . . . . . . 68
Glossary
BDX Batched DOACROSS.
CFG Control Flow Graph.
CPI Cycles per Instruction.
CPU Central Processing Unit.
DAG Direct Acyclic Graph.
DDG Data Dependence Graph.
DOAX DOACROSS Parallelization Algorithm.
DSWP Decoupled Software Pipeline.
ILP Instruction Level Parallelism.
ISA Instruction Set Architecture.
PDG Program Dependence Graph.
PMT Pipeline Multithread.
PS-BDX Parallel-Stage Batched DOACROSS.
PS-DSWP Parallel-Stage Decoupled Software Pipeline.
SCC Strongly Connected Component.
SIMD Single Instruction Multiple Data.
SMT Simultaneous Multithreading.
SPSC Single Producer Single Consumer.
TLS Thread Level Speculation.
Contents
1 Introduction 12
2 Background and Related Work 15
2.1 Background and Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2 DOACROSS Parallelization Algorithm . . . . . . . . . . . . . . . . . . . . 19
2.3 Decoupled Software Pipeline Parallelization . . . . . . . . . . . . . . . . . 20
2.4 BDX Parallelization Idea . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.5.1 Other Loop Parallelization Algorithms . . . . . . . . . . . . . . . . 22
2.5.2 Communication Overhead and its Mitigation . . . . . . . . . . . . . 24
3 Batched DOACROSS 28
3.1 The BDX Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.1 BDX Execution Flow . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2 BDX Comparative Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.3 BDX Code Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.3.1 BDX in CLang + OpenMP . . . . . . . . . . . . . . . . . . . . . . 38
4 Experimental Results 39
4.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.2 Sensibility Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.1 Batch Size and Stage Size . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 Effect of the number of Loop-Independent dependences . . . . . . . 44
4.2.3 Effect of the number of Loop-carried dependences . . . . . . . . . . 46
4.3 Communication Overhead - Quantitative Analysis . . . . . . . . . . . . . . 47
4.4 Fast Computation vs Fast Communication . . . . . . . . . . . . . . . . . . 49
4.5 Reducing Communication Overhead . . . . . . . . . . . . . . . . . . . . . . 50
4.6 Application Speedup and Overall Results . . . . . . . . . . . . . . . . . . . 51
5 Conclusions 56
Bibliography 57
A Speedup by Batch and Stage Size 66
B Speedup by variable and uses 67
C Serial Version of the Example Loop 69
D BDX Version of the Example Loop 70
E Sources Speedup by Batch and Stage Size 73
F Sources Effect of Loop-independent dependences 77
G Sources Effect of Loop-Dependent Dependences 84
12
Chapter 1
Introduction
There are many forms and levels (granularities) of parallelism in programs and researchers
and end-users have been investigating and exploring them for many decades [5,19–21,27,
32, 35, 36, 47, 50, 52, 63, 65]. Research in multiprocessor architectures date back to 1950s
and the use of Single Instruction Multiple Data (SIMD) parallelism started around early
1970s [10, 61, 72, 84]. Since then, researches have frequently used clusters of commodity
PCs to explore large scale parallelism [8,25,53,54,88]. Nowadays, commodity processors
are manufactured with multiple cores plugged to a coherent bus, all in the same chip [12,
39, 69, 74]. Moreover, each of these cores are capable of handling several forms of low
level parallelism, such as bit and Instruction Level Parallelism (ILP) [1, 3, 48, 69]. The
advent of multicore and Simultaneous Multithreading (SMT) technology in commodity
processors brought the possibility of exploiting parallelism to the hands of even the novice
programmer.
With the growth of the number of cores in multicore architectures together with the
decrease in clock frequency of individual processors, extracting parallelism from programs
has become a central task for improving program performance [14,28–30,37,38,86,93].
Given that program parallelization is an error prone task and that program’s loops
account for most of the program’s execution time [1, 6, 24, 26, 48], the development of
efficient loop parallelization algorithms that can be implemented in compilers to auto-
matically parallelize loops for execution in modern commodity processors has received
new attention in recent years [2, 15,16,66,75,78,92,94].
In the context of loop parallelization, loops are usually classified based on whether
their iterations are dependent on each other or not [1, 16, 22, 44, 48, 55]. We say that one
iteration j depends on another iteration i if data produced by one instruction executing
in iteration i is later consumed by an instruction executing in iteration j.
If there are no dependences among instructions executing in different iterations, the
loop is named DOALL [55]. DOALL loops are trivial to parallelize and frequently pro-
duce considerable speedups. Nevertheless, such loops are found mostly in regular (e.g.,
scientific) applications [9,22,23,44,48,51,55,68,85]. This is also the kind of loop explored
in vectorization [1, 48].
Loops containing dependences across iterations (i.e., loop-carried dependences) are
called DOACROSS loops [23, 48] and are the most common type of loops [9, 22, 23, 44,
48, 51, 55, 68, 85]. Efficient parallelization of these loops usually requires assigning loop
CHAPTER 1. INTRODUCTION 13
iterations to distinct execution threads and designing an orchestration mechanism to
manage the communication between threads.
The large number of different approaches proposed to parallelize DOACROSS loops,
indicate that there is no silver bullet for this problem [22, 44, 66, 75–77, 89, 90, 92]. In
practice, speedups achieved from above-mentioned algorithms are small, or the algorithm’s
requirements are considerable, ranging from expensive synchronization code [15, 48] to
changes in the processor micro-architecture [66]. For instance, DSWP [66] requires a
hardware-based inter-core queue to provide efficient speedups [81], and Helix [16] suggest
the use of helper threads for prefetching synchronization and inter-thread communication
data.
This work investigates the static and dynamic sources of inter-thread communica-
tion overhead in modern multicore architectures and state-of-the-art algorithms for loop
parallelization. It proposes a new algorithm for efficient automatic parallelization of
DOACROSS loops for multicore architectures. Specifically, the major contributions of
this thesis are:
• We propose a novel loop parallelization algorithm, called Batched DOACROSS
(BDX), which combines the best features of two modern loop parallelization al-
gorithms and can effectively parallelize loops of coarse and fine granularity across
different modern multicore architectures. Furthermore, we also propose an extension
to BDX, called Parallel-Stage Batched DOACROSS (PS-BDX), which can perform
considerably better under certain scenarios (when there are large parallel stages).
• We show that, considering a trade-off between instruction execution capability (i.e.,
Cycles per Instruction (CPI)) and inter core communication speed (i.e., cache co-
herency latency), Intel Ivy Bridge and ARM A9 MPCore are in two opposite end-
points of the design space and, thus, the performance of current algorithms is highly
dependent not only on loop granularity but also on specific characteristics of the
target architecture.
• We not only confirm that communication management is a key factor in state-of-
the-art algorithms and architectures, but also that loops parallelized using such
algorithms spend around 30% of the execution time handling inter-thread commu-
nication.
The work in this thesis resulted in the scientific publications below and in a granted
and licensed patent.
• D. C. S. Lucas and G. Araujo, “The Batched DOACROSS loop parallelization al-
gorithm,” 2015 International Conference on High Performance Computing & Simu-
lation (HPCS), Amsterdam, 2015, pp. 476-483.
• D. C. S. Lucas and G. Araujo, “The Batched DOACROSS loop parallelization algo-
rithm,” International Journal of Parallel Programming (IJPP), 2017. (submitted).
• D. C. S. Lucas and G. Araujo, “A Method to Parallelize Program Loops with Loop-
carried dependences,” B.R. Patent: 10 2014 023779 8, issued date May 17, 2016.
Licensed by Samsung C&T Corporation in September 04, 2012.
CHAPTER 1. INTRODUCTION 14
This thesis is organized as follows. In Chapter 2, we present a bibliographic review
and background on loop parallelization algorithms. We also describe in detail two algo-
rithms frequently used to parallelize DOACROSS loops. One of the main contributions
of this thesis, the Batched DOACROSS algorithm, is presented in detail in Chapter 3.
In Chapter 4 we present experimental results for the BDX algorithm and compare these
results with those obtained with other loop parallelization algorithms. The text ends with
our conclusions and ideas for future research topics in Chapter 5.
15
Chapter 2
Background and Related Work
In this chapter, we present a literature review of research on parallelizing loops with loop-
carried dependences. To achieve that we divide the chapter in four sections as follows.
In Section 2.1, we review the background definitions and concepts associated to loop
parallelism. This is followed by two sections that describe in detail the most relevant
loop parallelization algorithms used nowadays: DOACROSS (Section 2.2) and DSWP
(Section 2.3). The descriptions emphasize the communication overhead associated to
each of the algorithms, which we later use as motivation for the algorithm proposed in
Chapter 3. Finally, in Section 2.5 we review other algorithms that are related to the loop
parallelization problem studied in this thesis.
2.1 Background and Definitions
To support the discussion that follows, please consider a machine with a shared memory
and a dual-core processor (cores 0 and 1), each capable of running only a single thread.
Figure 2.1a shows an example of a sequential loop which runs through the pointers of
a linked list incrementing the value of the val field of each node. A complete version of
this code is included in Appendix C. This is an example of the kind of hot (i.e., frequently
executed) loop usually found in general purpose applications written in C/C++ [1, 48].
This loop’s Data Dependence Graph (DDG) is also shown in Figure 2.1a. This is an
example of a DOACROSS loop.
Definition 1. A loop-carried dependence is a data or control dependence between instruc-
tions executing on different iterations of the loop.
Definition 2. A DOACROSS loop is a kind of loop where there is at least one loop-carried
dependence between its iterations.
Loops in this class are characterized by having at least one non-scalar loop-carried
dependence. In this example, the loop-carried dependence happens on variable ptr which
is assigned the value of ptr->next on iteration i and is later used in iteration i+1 to access
the field next. This dependence appears as a self-loop in the DDG shown in Figure 2.1a.
Note that the implications of this are that the source and target of the dependence can
never be executed in parallel as they depend on each other. Also notice that, in practice
CHAPTER 2. BACKGROUND AND RELATED WORK 16
i1:    while (ptr = ptr->next) {
i2:         ptr->val = ptr->val + 1;
        }
i1
A
i2
B
ptr
ptr
(a) A DOACROSS Loop and its DDG.
i1:    while (i < N) {
i2:         c[i] = a[i] + b[i];
i3:         i = i + 1;
        }
i3 i2
i
i
i1
(b) A DOALL Loop and its DDG.
Figure 2.1: Examples of DOACROSS and DOALL Loops
this data dependence cycle will be composed potentially of several instructions, in some
cases even of all instructions of the loop - in this case it would not be possible to execute
the loop iterations in parallel.
For the sake of completeness Figure 2.1b shows an example of a sequential loop which
sums the entries of two integer arrays and stores the output in a third array. This is
an illustration of the kind of loop frequently found in scientific, mathematical and/or
multimedia applications [1, 48]. The DDG of this loop is shown in Figure 2.1b. This is
an example of a DOALL loop.
Definition 3. A DOALL loop is a kind of loop whose iterations are data and control
independent. There are no non-scalar loop-carried dependences in the loop DDG.
There are several automatic [78,92] and semi-automatic [77] approaches to parallelize
DOALL loops and usually they produce almost linear speedups since each iteration can
execute independently of each other. DOALL loops are usually called embarrassingly
parallel [48, 67].
From now on we will focus the discussion only on DOACROSS loops. Please, consider
again the DDG of Figure 2.1a. Note that if we compute the Strongly Connected Com-
ponent (SCC) [95] of this graph we will end up with two components. One containing
instruction i1 and the other containing instruction i2. It is not possible to execute dif-
ferent iterations of the same component in parallel [48], therefore many algorithms rely
on executing iterations of different components in parallel. The algorithms that we study
and propose in this thesis explore that insight. It is desirable to increase the size of the
components to amortize communication overheads. For this reason, several components
are frequently grouped together forming an Stage.
Definition 4. A loop’s stage is composed by one or more SCCs of the loop’s DDG grouped
together.
CHAPTER 2. BACKGROUND AND RELATED WORK 17
In the graph of Figure 2.1a we have two SCCs, namely A and B, containing respectively
instruction i1 and i2. It is also important to keep the size of the stages balanced to
maximize the effective parallelism of the loop [66].
It is difficult to speed up the example loop through thread level parallelization. This
is due to several reasons, among them are:
• Small loop body that leaves little room to compensate for executing instructions for
communication, synchronization, thread creation, etc;
• In the serial version of the loop the induction variable might be kept on a register or
in the core’s private cache. In the parallel version, usually neither of these happens.
• It might be trivial to use data prefetching on the serial version, while in the parallel
version the data access is spread among the threads.
• Unknown value of the loop’s trip count which makes static distribution of iteration
chunks among threads difficult and thus might hinder static load balancing [44].
• Consecutive iterations of the loop tend to access consecutive data in memory which
will lead to false-sharing in the parallelized version if proper care is not taken [34,
57,73].
To give concrete evidence of these overheads in parallel programs we parallelized the
example loop using DOACROSS Parallelization Algorithms (BDX, DOAX and DSWP).
Figure 2.2 and Figure 2.4 shows the source code of each parallelized version of the loop
and Figure 2.3 show an illustration of the loop during execution. For the discussion in the
next sections we use two metrics to quantitatively evaluate the inter-core communication
overhead of these algorithms:
Definition 5. Comm-instr: the number of instructions executed by a thread to handle
communication at each loop iteration.
Definition 6. Mem-instr: the number of shared-memory load/store instructions executed,
at each iteration, to communicate dependences.
For a given parallelized loop, comm-instr report the number of assembly instructions
added to the original loop code to allow data communication or synchronize access to
shared variables. The value accounts for all kinds of instructions: control (e.g., jumps,
cmps), arithmetic (e.g., add, sub) and memory access (e.g., load, store). This metric works
as a measurement of how many instructions are required by the parallelizing algorithm to
enable communication; therefore, it can give us an insight on how intrusive the algorithm
is with respect to the original loop behavior.
The mem-instr metric measures not only instructions executed to forward depen-
dences between iterations, but also instructions executed to access shared variables to
control synchronization between threads (i.e., synchronization flags). Note that, every
time a consumer thread is checking whether its dependence is already computed it needs
to check whether the producer thread has already updated a shared flag variable signaling
CHAPTER 2. BACKGROUND AND RELATED WORK 18
i1:    while (ptr = consume(1)) {
i2:         produce(2, ptr->next);
i3:       ptr->val = ptr->val + 1;
        }
i1:    while (ptr = consume(2)) {
i2:         produce(1, ptr->next);
i3:       ptr->val = ptr->val + 1;
        }
Thread 1
Thread 2
(a) DOACROSS Parallelized.
i1:    while (ptr = ptr->next) {
i2:         queue12.push(ptr);
        }       
        
i1:    while (ptr = queue12.pop()) {
i2:       ptr->val = ptr->val + 1;         
        }
        
Pipe Stage 1
Pipe Stage 2
(b) DSWP Parallelized.
Figure 2.2: Parallelization of the loop in (2.1a) using DOACROSS and DSWP.
A
B
Core 0
. .
 .
A
B
A
B
(a) Sequential
Core 0 Core 1
...
A
B
A
B
A
A
B
A
B
A
...
(b) DOACROSS.
Core 0 Core 1
...
A
A
A
A
A
B
B
B
B
B
...
(c) DSWP.
Core 0 Core 1
...
A
A
B
B
A
A
B
BA
A
...
(d) BDX.
Figure 2.3: Thread communication when using DOACROSS (2.3b), DSWP (2.3c) and
BDX (2.3d) for the stages (2.3a) of program 2.1a.
that the data has already been computed. In addition, also note that, mem-instrmeasures
only the cost of load and store instructions from and to shared memory, given that they
are the most relevant components of the communication overhead and that intra-thread
load and stores will be communicated through private caches or directly accessed through
registers.
A careful analysis of DOACROSS Parallelization Algorithm (DOAX), DSWP and
BDX optimized x86-64 assembly code was performed to count the instructions for these
two metrics. Table 2.1 lists these values for each algorithm and in the next section
we present them with a detailed analysis. The “/BS ” in BDX values indicates that this
number should still be amortized by the number of iterations in BDX batch size. Although
instruction count was done statically, it gives an estimate for lower-bound costs of the
communication overhead.
CHAPTER 2. BACKGROUND AND RELATED WORK 19
acquire(&StageA_Lock);
i0:     ptr = Sptr;
i1:    while (IdxA < BS && ptr = ptr->next)
i2:         BdxBufferPtrA[IdxA++] = ptr;
i3:    OriginalLoopCondition = (ptr != Null);
i4:     Sptr = ptr;
              
release(&StageA_Lock);
acquire(&StageB_Lock);
i1:    while (IdxB < IdxA) {
i2:       ptr = BdxBufferPtrA[idxB++];
i3:       ptr->val = ptr->val + 1;         
        }
release(&StageB_Lock);
while (OriginalLoopCondition == True) {
     int IdxA = 0, IdxB = 0;
}
(a) BDX parallelized - Thread 1
acquire(&StageA_Lock);
i0:     ptr = Sptr;
i1:    while (IdxA < BS && ptr = ptr->next)
i2:         BdxBufferPtrA[IdxA++] = ptr;
i3:    OriginalLoopCondition = (ptr != Null);
i4:     Sptr = ptr;
              
release(&StageA_Lock);
acquire(&StageB_Lock);
i1:    while (IdxB < IdxA) {
i2:       ptr = BdxBufferPtrA[idxB++];
i3:       ptr->val = ptr->val + 1;         
        }
release(&StageB_Lock);
while (OriginalLoopCondition == True) {
     int IdxA = 0, IdxB = 0;
}
(b) BDX parallelized - Thread 2
Figure 2.4: Source code for threads composing BDX parallelized example loop.
2.2 DOACROSS Parallelization Algorithm
DOAX 1, as proposed by Ron Cytron [22], is the name of the algorithm which eventually
named the loops with loop-carried dependences, since it was the first known approach to
parallelize such loops. The central idea is to parallelize the loop by assigning each iteration
to a different thread, in round-robin, and execute parts of the loop from different iterations
in parallel. However, as DOACROSS loops have loop-carried dependences, inter-thread
communication is required to manage them.
Figure 2.2a shows the loop of Figure 2.1a parallelized using DOAX. Note that all
threads execute the same slightly modified version of the original sequential loop body; the
main difference is the addition of functions to communicate the loop-carried dependences.
In the parallelized version, each stage of the loop is surrounded by function calls to send
and receive the loop-carried variables of that stage.
The main sources of overhead in DOAX parallelized loops are the communication direc-
tives [44,89]. The consume(X) directive is usually implemented as a function that makes
thread X wait for a signal from another thread (executing a previous iteration), which
indicates that the data which iteration X depends upon is already available. Function
produce(Y, Z) is responsible for signaling thread Y that the data Z is already computed.
Usually each thread executes in a different core and, since each thread (in turn) produces
data that will be subsequently consumed by another thread, the cache line that stores the
produced data will keep bouncing between each core’s private cache.
Figure 2.3b shows an illustration of the execution of the DOAX parallel version of
the example loop. Note that, during each iteration, DOAX communicate a loop-carried
dependence from A1 to A2 and vice-versa. At each such operation the loop reads, tests
and writes the synchronization flag variable and sends the loop-carried dependence data.
1We use DOAX to refer to the algorithm and DOACROSS to refer to the loop type.
CHAPTER 2. BACKGROUND AND RELATED WORK 20
Algorithm Comm-instr Mem-instr
DOACROSS 15 4
Decoupled Software Pipeline (DSWP) 125 6
BDX 16/BS 6/BS
Table 2.1: Loop parallelization synchronization overhead, BS is the size of the batch buffer
of BDX.
This corresponds to 2 load/stores to shared memory, in each direction, resulting in 4
memory instructions (mem-instr) per iteration (Table 2.1). DOACROSS communication
control is done through a fast busy-waiting loop on a shared-variable, which requires only
15 instructions to execute (comm-instr), as shown in Table 2.1.
Since the original work by Citron [22] much research has been done to improve
DOACROSS [22, 44, 66, 75–77, 89–92]. Recently, researchers have focused on innovative
ways to reduce DOACROSS overhead through the elimination of redundant synchro-
nization operations [89], or by removing synchronization barriers from nested DOALL
loops [42]. Some other algorithms use data prefetch to improve communication la-
tency [15–17].
Contrary to DOAX and their variations, the BDX algorithm proposed in this thesis
(see Chapter 3) maximizes the storage of inter-iteration data into thread-local buffers
Figure 2.3d. As detailed later in this thesis, BDX is a portable and effective approach to
the problem of loop parallelization.
2.3 Decoupled Software Pipeline Parallelization
DSWP [66, 82] is a Pipeline Multithread (PMT) algorithm for parallelizing loops with
loop-carried dependences. It transforms the loop’s body in a way that it becomes a
computational pipeline where each thread executes a different stage of the loop and data
dependences flow only in one direction among the stages. Two quite important features
of DSWP are: (a) it enables decoupled execution of parts of the loop body; and (b) unlike
DOAX, it does not severely suffer from inter-thread communication latency.
An overall description of the DSWP algorithm is as follows. The input to the algorithm
is a loop with loop-carried dependences, like the one shown in Figure 2.1a. In the first
step, DSWP constructs the loop’s Program Dependence Graph (PDG) [31], which is the
union of the Data and Control Dependence Graphs [1]. In the second and third steps,
DSWP finds the SCCs of the PDG and order them as a sequence of stages. These stages
will be computational units of the pipeline and each one will be executed by a different
thread.
Because DSWP does not break stages among threads, stages can be organized in a way
that dependences flow in a single direction among all stages, as shown in Fig. 2.3c. This is
possible given that, by construction, there are no dependence cycles among stages. Based
on this data/control dependence flow, stages are then organized in a pipeline. To enable
communication between stages, DSWP uses an inter-thread communication queue. Thus,
CHAPTER 2. BACKGROUND AND RELATED WORK 21
A runs independently of B, while filling in dependences required by B into an inter-thread
queue. Thus, contrary to DOACROSS, DSWP decouples the execution of A and B.
Figure 2.2b shows the sequential loop of Figure 2.1a parallelized by DSWP in a two-
stage pipeline. Note that, unlike DOACROSS, each thread executes a different portion
of the original loop body. Note also that communication flows only in one direction,
from stage A1 to stage B2, and that each iteration requires a enqueue or a dequeue
operation to/from the inter-thread queue. A careful evaluation of DSWP’s assembly
code reveals that, at each iteration, it does load/stores to shared memory (mem-instr),
using, on average 125 instructions for that (comm-instr). Therefore, DSWP requires
many more instructions to manage communication than DOAX. This overhead comes
from the fact that every enqueue or dequeue from the software queue requires several
instructions to check if the queue is empty/full, change pointers to head/tail, copy data,
and keep optimization related information updated [46,57,79].
The use of a queue to enable communication between stages has two important impli-
cations. First, each stage can execute almost independently of the others, until the queue
fills/empties. As a result, anything that affects one stage/thread (context switches, page
faults, cache misses, etc.) does not immediately affect other threads in the pipeline - as it
happens with DOACROSS. Second, DSWP overlaps communication with computation.
As DSWP is a PMT algorithm, the communication latency only affects the first message
in the pipeline - that is, the time before the pipeline fills. Therefore, while messages are
in transit from one thread to another, if threads themselves are not suspended, stages can
produce messages in almost constant rate. If the queues do not become empty, and the
communication bandwidth between threads can keep up with the data being transferred,
communication latency can be almost entirely hidden.
Nonetheless, in practice, it is very difficult to obtain performance speedups using
DSWP [79, 81, 96]. The reason is frequently the lack of an efficient inter-core queue.
When DSWP was originally presented [66] it was proposed to use a hardware-based queue.
However, until today there is no commodity hardware with such mechanism. Thus, all
experiments using DSWP rely on software based queues or hardware simulators [66,78,90].
As shown in Table 2.1, DSWP executes 125 instructions to control queue communi-
cation (comm-instr), and requires 6 memory instructions to forward data (mem-instr).
The problem with a software queue is the high overhead to execute instructions required
to manage it. This problem makes it difficult to use DSWP as a parallelizing algorithm
for general purpose applications, without the support of new hardware mechanisms. In
such programs, candidate loops to parallelization usually have only a dozen instructions
and the time required to execute the instructions to manage the queue is frequently larger
than the time to execute the loop body itself.
The above analysis suggests that any new loop parallelization algorithm should have
two primary features: (a) a small number of instructions required to manage synchroniza-
tion; and (b) communication operations with reduced frequency and costs.
DSWP [82] has been originally proposed as an approach to decouple the execution
of parts of a loop’s body with the goal of increasing ILP [40]. The algorithm was then
extended to enable automatic parallelization of loops [66]. Additional extensions enabled
the combination of DSWP with a DOALL form of parallelization [78], speculative ex-
CHAPTER 2. BACKGROUND AND RELATED WORK 22
ecution [90], and the combination of all these variations together [43]. Unfortunately,
although DSWP can be effective to hide synchronization latency, it still suffers from a
large overhead due to the cost of the instructions required to manage the inter-thread
communication queue [81]. Solutions have been proposed to minimize the overhead of
communication [79], but its cost is still high, particularly for complex loops which have
many stages [80].
Contrary to DSWP, BDX does not require a queue to work, using a simple and
lightweight synchronization mechanism (Fig. 2.3d), which only communicates loop-carried
dependences after several iterations of the loop have been executed (see Chapter 3).
2.4 BDX Parallelization Idea
Figure 2.4 shows an illustration of the source code of the example loop parallelized using
BDX. This source code is presented here only to help the reader contrast the parallelization
scheme of the three algorithms (BDX, DOAX and DSWP). A fully description of BDX is
presented in Chapter 3. A few points to compare the algorithms:
• To communicate loop-carried variables (e.g., ptr) BDX uses shared variables, DOAX
also uses shared variables and, DSWP uses local variables.
• To communicate loop-independent variables (e.g., ptr from the first stage to the
second), BDX uses thread-local buffers, DOAX uses local variables and DSWP uses
inter-thread queues.
• In DSWP each thread will execute a different part (stage) of the loop, forming a
pipeline. In a BDX or DOAX parallelized loop threads will execute the whole body
of the loop, but for different iterations.
• Note, nevertheless, that all three algorithms assume that stages have similar size.
2.5 Related Works
This section analyzes related work on loop parallelism, and is organized in three subsec-
tions as follows. First, Section 2.5.1 detail other loop parallelization algorithms. Second,
works that analyze the overheads involved in loop parallelism and discuss ideas to solve it
are presented in Section 2.5.2. Of course, some publications fit in more than one of these
categories, in such cases the publication is discussed in the category closest to the work
done.
2.5.1 Other Loop Parallelization Algorithms
Raman et al. observed that after applying DSWP algorithm to a loop, many stages
become loops that do not have loop-carried dependences and thus can be further paral-
lelized with DOALL. Motivated by this observation the authors proposed an algorithm
called Parallel-Stage Decoupled Software Pipeline (PS-DSWP) which aims to parallelize
CHAPTER 2. BACKGROUND AND RELATED WORK 23
DOALL stages create by DSWP parallelized loops [78]. In general, PS-DSWP results in
speedups that are greater than DSWP. PS-DSWP scales well for almost all benchmarks
discussed in [78], however for some of them stabilizes around 4 threads. It is worthwhile
to mention that this algorithm will produce better results than DSWP only when the
parallel stage(s) is (are) larger than the sequential one(s).
Huang et al. also propose applying parallelizing algorithms to pipeline stages cre-
ated after applying DSWP [43]. The authors’ insight is that after applying DSWP the
created stages have a more amenable dependence graph and thus more parallelism can
be extracted (even from “serial” stages). The proposed algorithm is named DSWP+
and is similar to PS-DSWP. DSWP+ is essentially an extension of what PS-DSWP pro-
pose. However, PS-DSWP only apply DOALL parallelization. The proposed algorithm
also applies other kinds of parallelization, namely DOALL, Spec-DOALL and LOCAL-
WRITE [91]. Unfortunately, they did not show an algorithm to automatically identify
what kind of parallelism can be applied to a loop. Their results show that for some
benchmarks the algorithm shows great speedup (maximum of 8x) but for others the al-
gorithm does not seem to bring new gains since applying only traditional optimizations
alone shows very similar speedups.
Campanoni et al. proposes a new algorithm, called “Helix”, for parallelizing DOACROSS
loops [17]. The approach consists basically in reducing the communication overhead
present in traditional DOACROSS parallelization algorithms using data prefetching. To
implement that they rely on the use of helper threads - each code in the system is cou-
pled with a helper thread that employs heuristics to prefetch inter core communication
data. They experiment the approach on a 6-core system with 13 benchmarks from the
SPEC 2006 suite. The average speedup is 2.25x reaching a maximum of 4.12x. Com-
pared to BDX their approach has the drawback of overloading the system with additional
threads (implicitly assuming that Hyper Threading is available) and the overhead of for-
warding the prefetched data through consecutive cores when the producer and consumer
cores/threads are far from each other.
Kim et al. presents a system to automatically parallelize loops for execution in large
cluster systems [49]. The target of the system are DOALL loops and DOACROSS loops
that have dependence edges that can be speculated (i.e.,: during program execution the
dependence manifest less than a threshold). They also optimize the parallelized loops
to reduce communication. They evaluated the system using 13 programs from the PAR-
SEC [11] and PolyBench suites. The system could speed up the execution of 8 programs,
with an average 43x on a 120-core system. The other 5 programs suffered slowdown due
to high communication, frequent mispredictions or due to the small execution time of the
loop parallelized.
The parallelization of DOALL loops is investigated in Zhong et al. [97]. The authors ar-
gue that the code synthesized by traditional compilers (after optimizations) is structured
in a way that hinders parallelism. They also noticed that many loops were not paral-
lelized due to control and register dependences and not because of memory loop-carried
dependences. This way, they propose several transformations to remove data and control
dependences of loops in order for them to become DOALL parallelizable. They also pro-
pose transformations (and also use speculation) to attack this problem. They used more
CHAPTER 2. BACKGROUND AND RELATED WORK 24
than 40 programs and a simulator to measure the impact of the proposed transformation.
Their results show that program execution coverage by DOALL loops were more than 2x
greater than the original detection with support of Thread Level Speculation (TLS) and
more than 8x the original detection without TLS. These results support what we observed
in our experimental results: statically parallelizing programs usually led to considerable
communication overhead due imprecise (conservative) static dependence analysis.
The work of Vandierendonck et at. [92] agrees with Zhong et al. [97] on that many
edges present in a statically created dependence graph are due to missing semantic in-
formation from compiler analysis. They try to overcome this problem by proposing a
set of annotations to let the programmer convey semantic information to their paralleliz-
ing compiler back end, called Paralax. This information is afterwards used to produce a
more accurate program dependence graph which may lead to more freedom when doing
automatic parallelization. They propose an algorithm to suggest places in the program
to insert the annotations and to dynamically check the correctness of the annotations.
Results are reported for four programs (bzip2, mcf06, hmmer and clustaw). The overall
speedup reached 1.79x, 2.06x, 7.0x and 2.33x, respectively. We consider the approach
interesting and promising, however the low speedup obtained when compared, for in-
stance, with Spec-DSWP [91] or PS-DSWP [78] seems difficult to justify maintaining
code annotations during all the life-time of a project.
Another promising approach to parallelize general purpose applications, proposed by
Prabhu et al., is called Commutativity Analysis [71]. The idea behind the algorithm is to
find coarse code regions that can be executed out of order; that is, to find code regions A
and B that are commutative. The point of finding commutative regions is to have more
freedom to schedule parallel regions of code. In this technical report, the authors present
a thorough review of the algorithm and present results showing that the algorithm can
considerably speedup programs compiled with GCC at the maximum optimization level.
Results show an impressive average speedup of 10x over serial programs (maximum of 30x
with 32 threads). Prabhu et al. notice that the challenge to make the algorithm common
practice is to improve the efficiency and precision of static data dependence analysis [71].
2.5.2 Communication Overhead and its Mitigation
Navarro et al. [62] presents a mathematical analysis of programs parallelized using the
pipeline model. The core of the model is the throughput that each node on the pipeline
is able to sustain during execution of the loop. The work compares the execution time
predicted by their model with the measured execution time of the programs. Their model
is able to accurately predict the execution time of programs. They show that a pipeline
with many non-consecutive “parallel” stages is slower than a pipeline with few “parallel”
stages. The reason is that the discrepancies in throughput between the parallel stages
lead to a lower final throughput in the pipeline when compared to the case when fewer
stages are used. The intuition here is that larger parallel stages are better than many
small ones due the need of communication of data between these stages. Unfortunately,
their experiments are centered on only two benchmarks (from the PARSEC suite): Ferret
and Dedup.
CHAPTER 2. BACKGROUND AND RELATED WORK 25
Lee et al. [57, 58] propose MCRingBuffer, a shared ring buffer that embeds a lock-
free, cache-efficient synchronization mechanism and hence can speed up the access to
shared data in multicore architectures. The proposed algorithm is a Single Producer -
Single Consumer lock-free queue that uses batch access to control variables to mitigate
false sharing and cache contention. To achieve that they divide the control structure of
the queue in four parts: (i) shared control variables; (ii) producer control variables; (iii)
consumer control variables; and (iv) constant variables. Padding bytes are inserted among
these groups to keep them in different cache lines, thus preventing false sharing to occur
due to changes in the consumer/producer control variables. Nevertheless, they argue that
only padding is not sufficient and that producer and consumer should not frequently read
each other private control variables. To avoid it they propose that each one should have
a copy of the other’s private control variables, which are updated in batch mode. They
evaluate the approach against Lamport BasicRingBuffer [56] and FastForward [34] and
show that MCRingBuffer has a throughput up to 5x larger than the other approaches.
We used MCRingBuffer for all experiments we report on this text.
Jablin et al. [46] explain the mechanism behind the Liberty Queue (L-Queue) - the
Single Producer Single Consumer (SPSC) software queue implementation used by the
Princeton Research Group that proposed DSWP [66], PS-DSWP [78], DOMORE [42],
Parcae [77], DOALL for Clusters [49]. In this paper they describe the challenges to port
the L-Queue implementation from x86-64 to IA-64. They also show the performance of
L-Queue (in gigabytes/s) when compared to other implementations - specifically FastFor-
ward [34] - in seven different machines. The L-Queue implementation was tuned to the
compiler and hardware used in their previous experiments. The most similar queue to
L-Queue is MCRingBuffer. Like MCRingBuffer, the L-Queue updates shared variables
infrequently (shared between the producer and consumer) to prevent cache line trashing;
this is implemented by updating shared variables only when a batch of elements have
been produced or consumed to/from the queue. However, Liberty Queues goes further
and uses SSE instructions to prefetch elements from the queue. They use characteristics
specific to the compiler code generation/scheduling to elide the use of barrier between
stores/loads and memory updates. The performance of the queue is about 1 GB/s.
Preudhomme et al. [73] propose a simple approach for reducing access contention and
false-sharing in software inter-core queues. Their proposed approach, called Clustered
Software Queue (CSQ), uses batch updates to the queue to further decouple producer
and consumer access to the queue. The idea is to partition the queue space in two and
make each of the actors work on a different portion of the queue. That is, while the
producer is enqueuing items in one part, the consumer would be removing items from the
other cluster. The idea does not seem quite different from what MCRingBuffer and other
approaches do, which is basically artificially creating some distance between the indexes
where the producer and consumer are working. Furthermore, it seems that this approach
would not perform very well when the stages are very imbalanced. They compare the
approach against FastForward, MCRingBuffer and inter-process communication. They
argue that their approach is better than the others due to its simple implementation and
use of fewer control variables. Chen et al. [18] describe a similar algorithm, the novelty
of the paper is that the algorithm is intended for weakly-ordered memory models.
CHAPTER 2. BACKGROUND AND RELATED WORK 26
Lee et al. [59] describes an approach to speedup queue operations by using hardware
mechanisms. The authors argue that current software implementations of Single Pro-
ducer - Single Consumer queues have a large instruction footprint and based on that they
propose to augment the processors Instruction Set Architecture (ISA) with instructions
that can manipulate RAM-backed queues. Such queues would reside in user-level memory
and not require any support from the operating system. The queue management algo-
rithm proposed to be implemented in the hardware is the previously mentioned MCRing-
Buffer. They implemented the suggested approach on a simulator and compared it with
three other approaches to manage queues: Lamport algorithm [56], FastForward [34] and
MCRingBuffer. The hardware accelerated queue could significantly outperform the other
approaches.
Luff et al. [60] propose a different, albeit similar, approach to improve software inter-
core queue efficiency. They propose to augment processors ISA with a single new instruc-
tion, called StRemote, which would receive three parameters: (1) a word of data to be
copied; (2) a destination core ID and; (3) a destination core memory address. The se-
mantic is to copy the data word to the memory location indicated in the parameter. The
idea is that the producer core will insert this instruction directly into the consumer core
instruction stream which will eventually make the instruction execute and therefore bring
the data from the producer core to the consumer core cache. The goal is to act like a data
prefetcher. When the consumer core needs the data, it will be already present in its pri-
vate cache. They use an architectural simulator to study how well this approach performs
against MCRingBuffer when both are used to enable execution of programs parallelized
with DSWP. Their results show that the new approach has a latency much smaller than
MCRingBuffer, however the increase in the pipeline throughput is small. The proposed
approach outperforms MCRingBuffer only when the stages are highly imbalanced.
Rul et al. [83] present a framework for finding pipeline parallelization opportunities.
The tool consists of a two-step approach. The first step instruments the code to collect
profiling information, that is basically execution counts and materialized data depen-
dences. Based on this profiling information they present to the programmer a collection
of loops that can be parallelized with DOALL, DSWP or PS-DSWP with their respective
expected speedups; the programmer is then responsible for choosing which loops should
be parallelized. They identify which loop can be parallelized based on the profile informa-
tion and looking for patterns of “pipeline-able” loops on the loop PDG - i.e., how many
stages are there in the loop? Are they small? They compare the number of parallelizable
loops that their framework finds and the number of loops found by a static compiler. As
expected, their framework finds more parallelizable loops. In general, no speedup is per-
ceived in loops parallelized only with DSWP. The real speedup comes from parallelizing
using algorithms such as PS-DSWP loops that have some parallel stage that represents a
large fraction of the loop body. They do not clarify some relevant detail of their approach,
leaving open questions like: (a) which queue was used? (b) how many stages were used
to parallelize the loops; or even (c) how many loops were parallelized in the programs?
The work done by Thies et al. [87] presents some quite interesting results. The pa-
per argues that the data communication in streaming applications is stable during the
execution of the program and across different executions. Based on this observation they
CHAPTER 2. BACKGROUND AND RELATED WORK 27
propose a tool for profiling the data dependences of an application (using small test in-
puts) and use the profiling information to construct a pipeline parallel loop that can
execute larger input sets. That is, they show that the data dependence pattern observed
in the test input is the same as in the larger input sets.
Pingali et al. [70] makes a series of claims about the limits of the dependence graph to
represent data dependences of irregular structured applications. Their main claim is that
many data dependence are functions of dynamically generated data (or program input)
and thus are hard to model statically. Another drawback of the dependence graph comes
from the fact that it cannot represent commutative execution order. To address this he
authors introduce a new data-centric way of viewing the parallelism in loops. Basically,
the programming model they propose analyze the dependences of a given region based on
the format of the data structure the region access.
28
Chapter 3
Batched DOACROSS
This chapter presents Batched DOACROSS (BDX), an algorithm that capitalizes on the
advantages of both, DOAX and DSWP, while minimizing their deficiencies.
3.1 The BDX Execution
Before moving into detail, we can shortly state that BDX explores parallelism in a manner
like a combination of DOAX and DSWP. In its first three steps BDX is like DSWP as
it: (a) constructs the PDG of the loop to be parallelized; (b) identify SCCs; and (c)
group SCCs into stages. In the remaining steps BDX resembles DOACROSS as it: (a)
executes SCCs; and (b) forward loop-carried dependences between threads. Moreover,
BDX employs the idea of executing each stage in batching mode to reduce the amount of
communication and synchronization required to execute the loop.
To detail the workings of BDX and for the sake of simplicity please consider the same
example of linked-list traversal presented in Fig. 2.1a (Chapter 2) which divides the loop
into stage A that traverses the list and stage B that updates the value of each list record.
When parallelizing a loop with BDX the number of threads executing the loop is equal
to the number of stages in which the loop was partitioned. Moreover, as illustrated in
Fig. 3.1, both stages A and B are present in the code of every thread, and each thread
executes each stage in turn. Appendix D shows a complete version of the BDX parallelized
version of the example loop. Parallelism occurs because during steady state execution of
the parallelized loop each thread runs a different stage. In the discussed example, two
threads each containing the code of Fig. 3.1 are running in parallel but while one of
them execute stage A the other is executing stage B and vice-versa. In this example
the communication between stages A and B occurs by means of a buffer BdxBufferPtrA
which is indexed using two index variables IdxA in stage A and IdxB in stage B.
The parallelization scheme of the BDX takes care of minimizing stage communication,
and is explained in detail below with the help of Fig. 3.2a, a detailed version of Fig. 2.3d.
There are two types of communication that a loop parallelization algorithm (that as-
sumes a pipelined execution model) needs to handle: (a) communication of loop-carried
dependence(s) between threads; and (b) communication of regular loop-independent de-
pendences that need to be communicated from one stage to another.
CHAPTER 3. BATCHED DOACROSS 29
acquire(&StageA_Lock);
i0:     ptr = Sptr;
i1:    while (IdxA < BS && ptr = ptr->next)
i2:         BdxBufferPtrA[IdxA++] = ptr;
i3:    OriginalLoopCondition = (ptr != Null);
i4:     Sptr = ptr;
              
release(&StageA_Lock);
acquire(&StageB_Lock);
i1:    while (IdxB < IdxA) {
i2:       ptr = BdxBufferPtrA[idxB++];
i3:       ptr->val = ptr->val + 1;         
        }
release(&StageB_Lock);
while (OriginalLoopCondition == True) {
     int IdxA = 0, IdxB = 0;
}
Figure 3.1: BDX Parallelized
Each stage of the loop has potentially many loop-carried dependence(s); at least one
associated to the induction variable of the loop. To migrate execution of a stage from
one thread to another, it is necessary to communicate the last value of the loop-carried
dependence(s), so the destination thread has information of which iteration it should
resume execution of the loop. Shared variable(s) for all combinations of stage and loop-
carried variable(s) are used to do that. This is illustrated by the rectangles on the top
of the dashed line between the two threads of Fig. 3.2a. In this example only one shared
variable (Sptr) is required, since there is only one loop-carried variable (the loop induction
variable) and it is used only in the first stage. The dashed line marks the expensive inter-
core communication boundary between cores.
Loop-independent dependences also need special attention. These dependences exist
between instructions of the same iteration in the original sequential loop (e.g., depen-
dence between A1 and B1 in Fig. 3.2a). In the original serial loop these dependences
are communicated through registers or through private caches, but they are all produced
and consumed inside the same iteration of the loop. In DOAX such dependences do not
cross thread boundaries, while in DSWP they are communicated using queues. Since each
iteration of a stage may produce different values to supply these dependences, and BDX
executes several iterations of one stage before starting execution of another one, a single
variable is not enough to store these dependences. Because of that, BDX maintains a
thread-local buffer for each loop-independent dependence in each stage that produce data
that might be consumed by a later stage. In Fig. 3.1 there is only one buffer, which is
called BdxBufferPtrA, and is associated with the variable ptr produced by stage A. In
Fig. 3.2a the BdxBufferPtrA is represented by an array of rectangles beside each thread.
CHAPTER 3. BATCHED DOACROSS 30
When later the thread starts executing a different stage and it needs to consume data
to fulfill a loop-independent dependence, the stage will use its induction variable to index
the buffer used to store values for that variable in the stage that is the source of the
dependence. Notice, that all stages in a thread execute the same number of iterations.
In Fig. 3.1 stage B will use its induction variable, called IdxB, to index into the stage A
local buffer for the loop-independent dependence of Ptr, that is, BdxBufferPtrA.
Parallel-Stage Batched DOACROSS. There is one simple optimization for the BDX
parallelization scheme presented above that can produce a considerably performance im-
provement under a certain common scenario. During the process of parallelization, the
loop is split in several stages based on the SCCs of the PDG. It may happen that some
of these stages do not have any loop-carried dependence(s). If that happens it means
that portion of the loop may be executed in parallel without any race condition. There-
fore, for those stages, it is possible to omit the synchronization directives that restrict
to only one thread executes that stage at a time. This simplification itself may produce
some small improvement because it is less code to be executed and less communication
happening between the threads. However, this characteristic also enables an important
optimization: it makes it possible to execute the loop with a higher number of threads
than in the basic BDX scheme. As mentioned before, in the basic parallelization scheme
the number of threads executing the loop is equal to the number of stages in the loop -
because each thread will execute a different stage in parallel. However, if a stage may be
executed by more than one thread the whole loop may be executed by a higher number of
threads. The optimum number of threads executing the stage (and therefore the loop) is
dependent on the relative size of the parallel stage(s) and the sequential one and therefore
will be different for each loop. We call PS-BDX the parallelization scheme that exploit
this characteristic of the loops.
3.1.1 BDX Execution Flow
This section describes in detail the execution of a BDX parallelized version of the example
loop shown in Fig. 3.1 starting from its very first iteration. Please notice that only stage
A has loop-carried dependences (variable ptr) and this variable is communicated between
threads through the shared variable Sptr. To support the following explanation, please
consider Figures 3.2 (a) – (f). Note that, in this example only two threads are used,
each represented by two rectangles labeled An and Bn, where n is the thread number.
The vertical dotted line is used to emphasize that these threads might be executing on
different cores. In the figure, dark colors represent a stage execution or buffer operation;
Arrows indicate memory transferences and, numbers on arrows define the sequence in
which operations take place. Operations that occur in parallel have the same number.
• Initially, in Fig. 3.2b, the first thread executes as many iterations of A1 as the
size of the batch and stores the dependences of stage B1 into the batch buffer
BdxBufferPtrA.
CHAPTER 3. BATCHED DOACROSS 31
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
trA
B
dx
B
uf
fe
rP
trA
(a) BDX stages and communication.
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
trA
B
dx
B
uf
fe
rP
trA
1
2
(b) A1 generates batch and loop-carried.
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
trA
B
dx
B
uf
fe
rP
trA
4
5
4
3
(c) A2 (B1) generates (consumes) batch.
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
trA
B
dx
B
uf
fe
rP
trA
7
8
7
6
(d) A1 (B2) generates (consumes) batch
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
trA
B
dx
B
uf
fe
rP
trA
(e) BDX enters steady-state.
A1
B1
A2
B2
Sptr
B
dx
B
uf
fe
rP
tr
A
B
dx
B
uf
fe
rP
tr
A
(f) BDX in steady-state execution.
Figure 3.2: BDX execution flow example.
CHAPTER 3. BATCHED DOACROSS 32
• When the first thread finishes executing A1, it then communicates, through shared
variable Sptr, the last value of its loop-carried variable and releases the second thread
to execute a batch of A2.
• Now on Fig. 3.2c, the second thread, once released to execute stage A2, reads the
latest value of that stage loop-carried variable Sptr.
• The second thread executes a whole batch of stage A2 and in the process fills-in its
local version of BdxBufferPtrA.
• The second thread writes Sptr with the value of the last loop-carried variable it
computed, so that stage A1 can later resume execution on the first thread.
• Still on Fig. 3.2c, the first thread, after releasing the second thread to execute stage
A2, starts the execution of stage B1 and reads-in the loop-independent dependences
from the local buffer BdxBufferPtrA. Note that, after this point, both threads are
already executing useful work in parallel.
• Continuing now on Fig. 3.2c, the first thread returns to execute stageA1 and before it
starts it reads the latest value of the stage loop-carried variable from BdxBufferPtrA.
• The first thread starts again the computation of A1 and stores into BdxBufferPtrA
the value of loop-independent dependences for later consumption in stage B1.
• The first thread finishes executing A1 and then updates the latest value of the shared
variable Sptr, after that, when the second thread finishes executing stage B2, it may
return to execute stage A2.
• While the first thread starts executing stage A1, the second thread also go for
executing stage B2 and so it reads the stage loop-carried dependence information
from variable Sptr.
Finally, BDX enters steady state, and for the following iterations, Fig. 3.2e and
Fig. 3.2f, it keeps switching the execution between A2 in parallel to B1, and A1 in parallel
to B2. The execution of the loop will finish once sufficient batches have been executed to
satisfy the original condition of the serial loop - which is replicated in the condition of the
first stage. Please notice that, the case where the number of iterations is not a multiple of
the batch size is not a problem, since the condition to finish the loop is that all iterations
have been executed and not based on number of batches executed. Moreover, once a
thread detects that all loop iterations have been executed it still releases the next thread
to continue execution of the loop. This second thread will also notice that all iterations
have been executed and then quit the execution of the loop. The main thread, has also a
portion of code to wait for all threads to notice that all iterations of the loop have been
executed.
CHAPTER 3. BATCHED DOACROSS 33
3.2 BDX Comparative Analysis
At this point, it is important to make some distinctions clear. While DOACROSS com-
municates each loop-carried dependence at each iteration, BDX only communicates them
after a batch of iterations has been computed. This batched mode is what enables BDX to
amortize the communication latency in processors with deep cache hierarchies. BDX also
differs from DSWP with respect to communication. While BDX keeps loop-independent
variables in a thread-local buffer, DSWP uses an inter-thread queue to communicate them.
Interesting enough, the number of loop-carried variables typically is much smaller than
the number of loop-independent variables (See Table 4.2). This observation is a central
idea which helps BDX to reduce communication overhead.
As explained above, BDX splits the sequential loop body into stages. All stages are
present in each thread and the algorithm tries to overlap their execution in time. In the
case discussed in the previous section, the original loop was split in two stages, though it
can be generalized for more than two stages/cores. Each thread simultaneously executes
two batches of iterations (one for each stage), containing BS iterations each (BS is the size
of the batch-buffer), before communicating the loop-carried dependences and switching
execution to another stage.
BDX uses a synchronization mechanism very like the one used by DOACROSS, i.e.,
busy-waiting on a shared flag, and thus the basic comm-instr and mem-instr costs would,
in principle, be very similar. Nevertheless, contrary to DOACROSS, loop-carried de-
pendences are communicated only after a batch of iterations has been completed, and
thus comm-instr and mem-instr costs are amortized across BS iterations (see Table 2.1).
Moreover, unlike DSWP, that communicates all the inter-stage dependences - for all it-
erations, BDX only communicates stage loop-carried dependences, every BS iterations.
As the number of loop-carried dependences is small (usually a single induction variable),
communication is reduced considerably.
Some further considerations are required to fully understand BDX potential and lim-
itations. First, since BDX tries to exploit parallelism by having each thread execute
a different stage simultaneously, the maximum number of threads possible in the sys-
tem is equal to the number of stages in which the loop was partitioned. Notice that
stages are SCC, which cannot be further parallelized unless they have no loop-carried
dependences. Thus, if stages are DOALL they can be further parallelized, similarly as in
PS-DSWP [66, 75, 76]. Of course, additional stage parallelization could also be achieved
by using speculative execution and is something that we plan to investigate in the future.
The second issue on BDX is related to the definition of the size of the batch buffer (BS).
Notice that for uncountable loops it is not possible to statically determine the size of the
batch buffer, though empirical experiments showed that batch buffers from 50 – 100 iter-
ations produce good results (See Section 4.2). DSWP does not suffer directly from this
problem, but it also needs to specify the size of the queue. Nevertheless, as mentioned
before, it considerably suffers from overhead maintaining its queue (See Section 4.3). In
general, parallelizing loops which are small or which have small trip-counts does not pay
off for any parallelization algorithm, and should be avoided. The third issue is load bal-
ancing. The potential speed up produced by BDX, DOACROSS and DSWP is limited
CHAPTER 3. BATCHED DOACROSS 34
by the execution time of the largest stage, since they are all pipelining algorithms. More-
over, load balancing is a typical problem in parallel programming, and is not specific to
any of these algorithms. Therefore, the amount of parallelism created by all three algo-
rithms is limited by the same main factors: number of stages, size of the loop/stages and
load balancing. This suggest that an efficient mechanism to handle inter thread/stage
communication will be a key factor to differentiate the performance of each algorithm.
3.3 BDX Code Generation
In this section we present the algorithm used to perform the code generation of BDX and
discuss a few implementation detail.
Algorithm 1 shows BDX code generation algorithm. As mentioned previously the
algorithm idea is to split the target loop in a list of stages, each of these stages being a
collection of SCCs of the target loop PDG. Note that each of these stages may contain one
or more SCCs. In general, each stage has at least one loop-carried dependence, however,
there are some cases where a stage does not have a loop-carried dependence; in this case
we call that stage a parallel stage and the loop is parallelized using the BDX variant called
PS-BDX. Of course, at least one stage must have at least one loop-carried dependence,
otherwise the loop would be an embarrassing parallel DOALL for which parallelism is
trivial.
The algorithm input is the Control Flow Graph (CFG) of the loop region to be par-
allelized. The first step in the Algorithm 1 (Line 3) is to construct the PDG of the loop.
As mentioned before, this graph contains both, the data and the control dependences of
the loop to be parallelized and therefore it has all information needed to determine which
portion of the loop may execute in parallel with each other. Note that this graph also
contains all information regarding loop-independent and loop-carried dependences.
The second step in the algorithm (Line 4) is to construct what we call “DAG of SCCs”,
which is a digraph constructed from the PDG, where each SCC was collapsed into a single
node with all incoming and outgoing edges adjusted properly. For instance, it is common
to have a SCC containing the instruction to increment the induction variable as well as
the instruction(s) to compute the loop execution condition, during the construction of the
“DAG of SCCs” all nodes of the PDG representing these instructions will be grouped into
a single node.
As the name suggests the “DAG of SCCs” is a Direct Acyclic Graph (DAG) and by
definition it does not contain any cycles. This characteristic is quite important for the
third step of Algorithm 1 (pipeline creation), because it enables us to create a precise
topological order of the nodes. For instance, we can take the root node of the graph and
consider it as the first stage in the pipeline and then all nodes that have incoming direct
edges from the first stage will form the second stage and so on. The goal of the third step
of Algorithm 1 (Line 5) is to implement this and create a sequence of stages by grouping
together adjacent nodes of the “DAG of SCCs”. Note that, each of these stages may have
self-edges, which represent loop-carried variables. They also may have edges coming from
preceding stages, however stages must not be grouped when a stage has an incoming edge
CHAPTER 3. BATCHED DOACROSS 35
Algorithm 1: Batched DOACROSS Algorithm
Input : Control Flow Graph of the loop region to be parallelized.
Output: Control Flow Graph of the parallelized loop.
1 Function BDX(Digraph cfg) is
/* Store loop information such as increment step and condition */
2 LoopInfo li = ExtractLoopInfo(cfg);
3 Digraph pdg = ConstructPDG(cfg);
4 Digraph dag = CollapseSCCs(pdg);
/* HInfo contains info about which SCCs to group together. */
5 Digraph pipeline = ConstructPipeline(dag, HInfo);
/* Iterate over stages in the order they appear on the pipeline */
6 foreach stage in pipeline.nodes() do
/* Transform the stage into a loop. */
7 promote_stage_to_loop(stage, li, BDXBatchSize);
/* Add locks to control which thread can execute the stage */
8 add_stage_synchronization(stage, pipeline.nodes().size());
/* Iterate over loop-carried dependences */
9 foreach edge in pipeline.self_edges() do
10 create_shared_var(edge.label);
11 add_copy_from_shared(stage, edge.label);
12 replace_use_local(stage, edge.label);
13 add_copy_to_shared(stage, edge.label);
14 end
/* Iterate over loop-independent, incoming edges. */
15 foreach edge in stage.incoming_edges() do
16 replace_to_use_buffer(stage, edge);
17 end
/* Iterate over loop-independent, outgoing edges. */
18 foreach edge in stage.outgoing_edges() do
19 create_buffer(stage, edge, BDXBatchSize);
20 replace_to_use_buffer(stage, edge);
21 end
22 end
/* Adjust surrounding loop to execute batches repeatedly. */
23 adjust_surrouding_loop(pipeline, li);
/* Reconstruct the Loop CFG using the PDG. */
24 return ReconstructCFG(pipeline);
25 end
CHAPTER 3. BATCHED DOACROSS 36
from a further stage (a stage closer to the end of the pipeline), since this would form a
cycle. In other words, edges are only allowed in one direction: from the beginning to the
end of the pipeline.
Since the BDX algorithm is based on the cooperation of batches and threads to execute
all iterations of the original loop, each stage needs to be transformed in a loop that executes
a fixed number of iterations; of course, this number needs to be smaller than the total
number of iterations of the original loop. This is like what is done in Loop-Tiling [1].
Implicit here is the number of threads that will execute the parallel loop, which should
be equal to the number of stages into which the loop was split. Therefore, if we have NS
stages there will be NS threads and each of them will execute a different stage at any
time. When a thread finishes executing a batch for a stage it starts the execution of the
batch for the next stage and so on.
Line 6 of Algorithm 1 starts the more complicated part of the algorithm, which is the
one that applies all transformations needed to create each stage. This involves batching
(Line 7), synchronization (Line 8), handling dependence communication (Lines 9 − 14)
and loop-independent dependence buffering (Line 15 − 21). We assume here that stages
are iterated in the order that they appear on the pipeline.
The function call at line 7 has for goal to transform a stage into a smaller version of
the original loop that iterates for the size BS of the batch_buffer ; this is specified by the
parameter BDXBatchSize in Algorithm 1. What this step does is effectively surround the
stage with a loop that executes from 0 up to BDXBatchSize - 1 or until the loop ending
condition is met.
The function call at line 8 is used to add synchronization instructions surrounding the
stage so that no two threads execute the same stage in parallel - with the exception of the
case where the stage is parallel. A straightforward way to implement this synchronization
is to use an array of flags, where each position holds the ThreadId of the thread that can
execute the stage.
After a stage has been transformed into a loop and have synchronization directives
added what remains is to add instructions to handle the communication of dependences
from one stage to another (loop-independent ones) and the loop-carried dependences that
need to be communicated from one stage to itself, but executing on a different thread.
At line 9 we handle self-edges, that is, edges that are from a stage to itself and represent
loop-carried dependences. As explained in the previous section, these dependences are
communicated through a variable that is shared among all threads. Therefore, at line 10,
a function is called that will create a shared variable for each self-edge of the stage. At
line 11 instructions are added, just before the first instruction of the stage, to copy data
from these shared variables to local private variables. This is done so that when a thread
is about to start the execution of a new batch of iterations it will copy the values of all
loop-carried variable from the shared variables to local variables, which may happen to
be stored in registers. After that, at line 12, the stage is modified to use these local copies
instead of the original variables of the loop.
At line 13 instructions to do the opposite process are inserted just after the last
instruction of the stage. Once a thread finishes executing an iteration batch it needs
to copy the contents of the loop-carried dependences from the local variables to shared
CHAPTER 3. BATCHED DOACROSS 37
memory, so that the next thread that will execute that stage has access to the most recent
data and can continue execution after the last iteration was ran by the previous thread.
At line 15 the algorithm iterates over the incoming edges to the stage, that is edges
that are from loop-independent dependences. These dependences are data produced by a
previous stage, not necessarily an immediate stage, but some stage earlier in the pipeline.
The function call at line 16 will change the stage so that instead of using the original
variables for these dependences it will copy the data from a local buffer from the producer
stage of that dependence. This buffer is indexed using the iteration number; For instance,
if the consumer stage executing iteration i needs to consume variable A that has been
produced by a previous stage it will index the producer buffer for variable A at position i
to fetch the data. At the end of this call all uses of loop-independent variables produced
by a previous stage will be replaced by accesses to local buffers.
In an analogous way, at line 18, we iterate over all edges that are outgoing from the
stage, that is loop-independent dependences that this stage act as a producer. Those
represent writes to variables that will be read by a subsequent stage. It is important to
note that these writes need to match a read for the same iteration. For instance, whenever
a stage is executing a batch of iterations it will produce a different value for the variable
in each iteration, the consumer stage, will also execute the same batch of iterations and
it will read the respective data produced in each iteration of the producer stage. The
function call at line 19 is responsible for creating the buffer to store these variables and
the call at line 20 transforms the stage so that instead of writing to the original variables
of the dependences it will write to the stage’s local buffer specific to each variable.
It is important to note that the local buffers must have enough space to store values
for each iteration of the producer stage. In terms of implementation this means that the
buffer should have a size equal to the number of iterations in the batch. Also note that
these buffers are specific to the producer stage; if another stage later produce the same
variable it will write to its own local buffer.
Finally, at line 23, the induction variable is adjusted to change the loop from executing
iterations one by one to execute in steps equal to BDXBatchSize. At line 24 the CFG of
the new loop is created and returned.
For the conclusion of Algorithm 1 one last thing remains to be done: the insertion of
code to create threads. What we did in our implementation was to insert the transformed
code into a function that represents the parallel loop and then make each new thread
execute this new function. The original code of the serial loop must also be replaced
with the code produced by the Algorithm 1, so that the main thread also participates in
the execution of the loop. Therefore, if the loop was partitioned in NS stages, NS − 1
new threads will be created to assist the main thread to execute the loop in parallel.
In our implementation, we create these new threads at the beginning of the program
execution. Since the parallel loop might be executed many times we also added code to
the new threads to activate and deactivate them whenever a new visit to the parallel loop
happens. This toggle mechanism was implemented using a shared flag variable in which
the new threads busy-wait to check whether or not it should starts the execution of the
loop again.
CHAPTER 3. BATCHED DOACROSS 38
3.3.1 BDX in CLang + OpenMP
The code generation algorithm just described is being implemented in Clang+LLVM by
another student of our research laboratory. The idea is to extend the specification and
an implementation of OpenMP to enable the programmer to choose between different
algorithms for loop parallelization, particularly the user will also be able to choose BDX
and PS-BDX.
Listing 3.1 shows the example loop (See Figure 2.1a) annotated with an OpenMP
pragma. Please notice that, in this version the user could employ the new clause use to
tell the compiler to parallelize the loop using BDX or PS-BDX and use a batch size of 50
iterations.
The latest versions of the OpenMP Specification [64] has gradually added support for
parallelization of DOACROSS loops by adding new clauses such as ordered and sections.
Using these clauses the user can specify regions (may be understood as stages) of the loop
that should be executed in loop iteration order and convey to the compiler important
data dependency information. We believe that this corroborate with this thesis on the
importance to parallelize DOACROSS loops and to let the programmer experiment with
different parallelization algorithms.
1 #pragma OMP parallel f o r use ( bdx : 5 0 )
2 whi le ( ptr = ptr−>next ) {
3 ptr−>val = ptr−>val + 1 ;
4 }
Listing 3.1: Illustration of example loop annotated with OpenMP+BDX
39
Chapter 4
Experimental Results
In this chapter, we show results of experiments we have carried out to characterize the
performance of BDX and PS-BDX in comparison to that of DOAX, DSWP and PS-
DSWP. In the first section, we present the methodology used in the experiments. Next,
in Section 4.2 we make a thorough analysis of BDX’s sensibility to its main parameters:
stage size and batch size. Then, in Section 4.3 we show results quantifying the fraction of
parallelized loops execution time spent in handling communication. Next, in Section 4.4
we show how the underlying architecture features can affect the performance of parallelized
programs. Subsequently, we show that batching is an effective way to mitigate inter-thread
communication overhead and reduce false-sharing. Finally, we show the best speedup
obtained by each algorithm and analyze the performance improvements obtained from
the applications.
4.1 Methodology
For the sake of analyzing the performance of the algorithms across different modern
architectures all experiments were run on two target architectures (Intel and ARM);
detailed information of these architectures is shown in Table 4.1. Programs were manually
parallelized and compiled with GCC 4.8 (Cross GCC 4.6) with optimizations set at O3,
and hardware floating point support enabled. Profiling information was collected using
Intel VTune, Linux Perf and ARM Streamline.
Intel ARM
Processor Intel Ivy Bridge ARM A9 MPCore
Core Count 4 Cores (8 HT Cores) 4 Cores
Clock Frequency 3.40GHz 1.4GHz
Cache L1I and L1D 32KB, 32KB 32KB, 32KB
Cache L2 256KB 128KB
Cache L3 8MB N/A
RAM Memory 8GB 1GB
Operating System Ubuntu 14.04 Android 4.1.2
Table 4.1: Target architectures detail.
CHAPTER 4. EXPERIMENTAL RESULTS 40
Program Stages SccCar InScc SEQ BDX DOAX DSWP
Ferret S-S 0†-0† 3 140 148 177 462
Bodytrack S-S 0†-0ψ,† 2 513 614 641 903
Bzip2 S-S 1†-1† 2 24 46 64 104
S-S 1†-1† 2 21 51 55 98
Hmmer P-S 0-1 1 67 106 91 135
Mcf S-P 1-0 3 31 36 63 118
H263Dec P-S 0-1 10 95 139 131 248
P-S 0-1 6 109 156 146 287
KS S-P 4-0 4 147 198 183 317
Otter S-P-S 1-0-3 1 124 168 152 172
Rotate S-P-S 1-0-1 1 237 307 297 264
Rot-CC S-P-S 1-0-1 1 206 300 298 267
Ray-Rot S-P-P-P-S 1-0-0-0-1 1 291 322 365 318
RGBYUV S-P-P-S 1-0-0-1 1 128 320 198 155
† This stage performs I/O operations.
ψ The state of this stage is stored in a complex object.
‡ Function inline was applied in this loop.
Table 4.2: Characteristics of Parallelized Programs and Their Loops. The SccCar column
shows how many loop-carried dependence each stage has. The InScc column shows the
maximum number of loop-independence dependences between two stages. The number
of x86 instructions of the parallelized loop is shown, for each algorithm and loop, in
the last four columns.
CHAPTER 4. EXPERIMENTAL RESULTS 41
Program Input Loop* % Exec Visits/TC L-Balancing
Ferret Large Loop-1 99% 1 / 256 16% 84%
Native Loop-1 99% 1 / 3500 10% 90%
Bodytrack Large Loop-1 92% 1 / 4 12% 88%
Native Loop-1 92% 1 / 260 15% 85%
401.Bzip2 Ref—pro Loop-1 79% 3 / 58721 10% 90%
Loop-2 21% 3 / 58721 90% 10%
Ref—sou Loop-1 84% 3 / 58721 10% 90%
Loop-2 16% 3 / 58721 90% 10%
Ref—tex Loop-1 92% 3 / 58721 10% 90%
Loop-2 8% 3 / 58721 90% 10%
456.Hmmer Ref–Swi Loop-1 94% ~45M / 300 39% 61%
Ref–Ret Loop-1 87% ~277M / 100 40% 60%
Train Loop-1 92% ~40M / 100 40% 60%
Test Loop-1 57% ~15M / 10 65% 35%
429.Mcf Ref Loop-1 47% ~21M / 300 90% 10%
Train Loop-1 34% ~750K / 300 90% 10%
Test Loop-1 45% ~234K / 300 85% 15%
H263Dec Large Loop-1 58% ~69M / 704 44% 56%
Loop-2 34% ~42M / 576 59% 41%
Medium Loop-1 57% ~23M / 704 43% 57%
Loop-2 33% ~14M / 576 59% 41%
Small Loop-1 57% ~2,3M / 704 45% 55%
Loop-2 33% ~1,4M / 576 62% 38%
KS KL-5 Loop-1 99% 7750 / 5271 8% 92%
Otter Two-Inv Loop-1 32% 990k / 327 2% 94% 4%
Rotate Berlin-Garten Loop-1 100% 1 / 9 15% 63% 22%
Rot-CC Berlin-Garten Loop-1 100% 1 / 9 16% 75% 9%
Ray-Rot Scene Loop-1 100% 1 / 24 4% 3% 82% 7% 4%
RGBYUV Berlin-Garten Loop-1 100% 1 / 10 2% 88% 9% 1%
? Loop-1/Loop-2 refers to the order they are presented in Table 4.2.
Table 4.3: Runtime information about the loops parallelized and their stages
CHAPTER 4. EXPERIMENTAL RESULTS 42
Experiments were done considering a set composed of 12 benchmark programs - 14
loops, ranging from audio decoding to image search: Ferret and Bodytrack are from
Parsec-3.0 [11], H263Dec is from Mediabench-2 [33], 401.Bzip2, 429.Mcf, 456.Hmmer are
from SPEC CPU 2006 [41], KS and Otter are from the Liberty Group [78] and Rotate,
Rot-CC, Ray-Rot and RGBYUV are from Starbench [4]. Table 4.2 shows detail about
the programs and loops that were parallelized. The first three columns of this table show
program name and information about loop partitioning in stages. Column Stages shows
in how many stages the loop was partitioned and the type of each stage (S-Sequential,
P-Parallel). The next column, SccCar, shows the number of loop-carried variables present
at each stage, and column InScc, the number of inter-SCC dependences. The last four
columns show, for each parallelization algorithm, and the serial version, the number of
x86 instructions of the parallelized loop - callee functions’ instructions were not included.
These programs were selected based on several constraints. First, we looked for pro-
grams that have hot loops1 which could be parallelized using the pipeline model. Further-
more, we also considered programs that could be portable to an ARM architecture and
have feasible memory footprints and execution times on both desktop and mobile archi-
tectures. This was done because we want to evaluate the sensitiveness of the algorithms
across different architectures and platforms.
Many programs were parallelized using two stages. This is a consequence of the small
number of (source) instructions in these loops, and the heavy use of pointers and arrays
which hinder static dependence analysis. Several stages are sequential because they have
I/O operations in their loop bodies. Nevertheless, eight of these programs contain at least
one parallel stage. One key feature frequently found in almost all loops deserves a special
attention: the number of loop-carried dependences is typically smaller than the number
of inter-SCC dependences. This insight is central to the design of BDX. By keeping inter-
SCC dependences local and only communicating loop-carried dependences, BDX reduces
the number of required communication operations.
In our experiments, we found that intrusiveness of the algorithm (i.e., how much it
modifies the source of the original loop) is an important aspect to consider when paral-
lelizing fine grained and/or imbalanced loops; particularly when the target is a mobile
architecture such as the ARM A9 MPCore. The last four columns of Table 4.2 list the
number of instructions in the body of the parallelized loop for each studied algorithm.
We consider as an estimate of intrusiveness of an algorithm the difference between the
number of instructions of the loop parallelized using that algorithm to the number of
instructions in the sequential loop body (column SEQ). That is, the more instructions
the algorithm adds the more intrusive it is. Hence, as shown in Table 4.2, BDX and
DOAX are the algorithms which add fewer instructions to the original loop and therefore
require the least changes to the original code. Usually DOAX requires fewer instructions
because it does not require instructions to manipulate buffers or update buffer indexes.
However, it requires instructions to propagate synchronization flags in both directions of
every conditional branch. DSWP considerably increases the loop body due to instructions
to manage (synchronize) the access to the inter-stage queues and to copy data. This over-
head is even more noticeable for coarse grained C++ programs that need to communicate
1Loops which account for a large share of program’s execution time.
CHAPTER 4. EXPERIMENTAL RESULTS 43
complex objects through the queue (this is the case of Ferret and Bodytrack).
Table 4.3 shows runtime characteristics of all programs/loops we parallelized. The first
three columns identify the loop that was parallelized; Loop-1/Loop-2 refers to the order
in which the loops were presented in Table 4.2. Column % Exec. shows the percentage of
the program execution time that the loop accounts for, while column Visits/TC presents
the number of times that the loop is executed and its trip-count (i.e., average number of
iterations). The last column, L-Balancing, shows the percentage that each stage of the
parallelized loop represents of the total loop’s execution time.
The motivation to show the data in Table 4.3 are twofold. First, it helps understand
the results that we are going to present in the next sections. For instance, one can observe
that there is a high variance on the execution time coverage, load balancing and loop trip-
count across different inputs of the same program. Thus, one can hardly expect the same
speedup across different inputs. For some programs, even balancing direction changes
(e.g., Hmmer, MCF) with different inputs. Second, these programs stress distinct aspects
of the parallelizing algorithms. Some loops are well balanced and others are imbalanced,
some loops have high trip-count/loop-visits others not. Being able to produce speedups
(or at least not hurt performance) under this situation is important when one consider
using these algorithms for automatic parallelization.
Each algorithm was applied by strictly following the algorithms presented by their
authors [22,66]. Moreover, for the sake of comparing different batching implementations,
we also show results for a “DOACROSS-Unrolled” algorithm, which consists, basically,
of statically unrolling the loop and then parallelizing it using traditional DOAX. We
will refer to this DOAX version simply as DOAX-U. Whenever we refer to DOAX batch
size or unroll factor we are referring to the number of times that the loop was unrolled
in DOAX-U. We have also experimented DSWP with several queue implementations and
optimizations; the results we show were derived using the best portable software queue im-
plementation that we found (MCRingBuffer [57]) and variable queue sizes (which change
across benchmarks).
4.2 Sensibility Analysis
In this section, we analyze the performance of BDX as a function of its parameters and
loop properties. The goal is to provide quantifiable results that will help a potential user
to choose BDX parameters and provide insights on how applicable the algorithm will be
in practice.
4.2.1 Batch Size and Stage Size
Figure 4.1a shows BDX performance as a function of Batch Size and Stage Size. Fig-
ure 4.1b is a close-up on the left-most lowest quadrant. Colors close to red represent
larger speedups. For this experiment a synthetic benchmark with two stages, perfectly
balanced, was used, its loop does not have any loop-carried or loop-independent depen-
dences, thus, both stages only execute its payload - which is simulated using an empty
CHAPTER 4. EXPERIMENTAL RESULTS 44
100 200 300 400 500
Batch Size
20
40
60
80
100
St
ag
e 
Si
ze
0.00
0.25
0.50
0.75
1.00
1.25
1.50
1.75
2.00
(a) Speedup Variation.
10 20 30 40 50 60 70 80 90 100
Batch Size
2
4
6
8
10
12
14
16
18
20
St
ag
e 
Si
ze
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
(b) Close-up on bottom-left corner.
Figure 4.1: Performance speedup of BDX in function of batch size and stage size. Colors
closer to red represent larger speedups.
for loop. Although the loop does not have loop-carried variables, synchronization direc-
tives were still added to account for their overhead on a real-world benchmark. The loop
execute 1,000,000 iterations and the results reported are an average of 10 executions.
A first observation to notice is that BDX can produce speedups for most of the sce-
narios analyzed - larger predominance of shades of red. Notice that the largest speedup
possible in this case is 2x which is obtained with large batches and/or large stages.
We also notice that for tiny stages, around 1 to 3 instructions, the performance of
the parallel loop is very poor and a large batch size must be used. Nevertheless, even for
small stages, about 4 to 10 instructions, a small batch size (> 25) is enough to prevent
slowdowns and even produce speedups.
By analyzing Figure 4.1a (and the raw data provided in Appendix A) we see that
stage size has more impact on performance than batch size. This is the reason for the
predominance of the blue color across all the range of batch sizes in the bottom of the
figure.
One important conclusion of this experiment is that BDX can produce good speedups
if the target loop has a sufficient number of iterations to accommodate large batches
and supports a well-balanced partitioning in stages. Overall, for BDX to produce good
speed-ups, stage sizes should be larger than 40 instructions and the number of iterations
of the loop support batch sizes larger than 100 iterations. Consistent slowdowns were
only observed in extreme cases where the stage size was tiny, less than three instructions,
or the batch size is quite small, less than 10 iterations.
4.2.2 Effect of the number of Loop-Independent dependences
Figure 4.2a shows BDX performance as a function of the number of loop-independent
dependences (variables) and the number of times each variable is used.
For this experiment several synthetic benchmarks, all of them with two stages, per-
fectly balanced, were used. The only difference between these benchmarks is the number
of loop-independent variables and number of times that each variable is used. As before
CHAPTER 4. EXPERIMENTAL RESULTS 45
2 4 6 8 10 12 14 16 18 20
Number of Variables
2
4
6
8
10
12
14
16
18
20
Nu
m
be
r o
f U
se
s
0.60
0.75
0.90
1.05
1.20
1.35
1.50
1.65
1.80
(a) Speedup Variation.
2 4 6 8 10
Number of Variables
2
4
6
8
10
Nu
m
be
r o
f U
se
s
0.60
0.75
0.90
1.05
1.20
1.35
1.50
1.65
1.80
(b) Close-up on bottom-left corner.
Figure 4.2: Performance speedup of BDX in function of the number of loop-independent
dependences and uses. Colors closer to red represent larger speedups.
all stages have a simulated payload and are surrounded by synchronization directives. In
this scenario the stage size was fixed at 10 and the batch size at 40 - these numbers were
chosen based on the results presented in the previous section and looking for those that
produced stable speedups. As before, each loop executes 1,000,000 iterations and the
results reported are an average of 10 executions.
For instance, the point (8, 10) is filled with a color representing the speedup of a
parallelized benchmark that contains 8 loop-independent variables where each of them is
used 10 times inside each stage. As before, colors close to red represent larger speedups.
Figure 4.2b is a close-up on the region of at most 10 variables and 10 uses.
First thing to notice is that the figure is quite symmetric - in relation to a diagonal
line from bottom-left to upper-right. This indicates that increasing the number of uses of
a variable or just increasing the number of variables and keeping the same usage affects
the performance in a comparable way.
Another observation is that the baseline configuration of these benchmarks produced
a speedup of about 1.25x - as shown in the previous section and Appendix A. However,
in this experiment, by just increasing the number of variables and/or uses, speedups as
large as 1.80x can be obtained. This suggests that when the loop has more variables
and/or uses the effect in performance is the same as if the loop had a larger number of
instructions operating on the same variables (or even registers) and can help amortize
communication overheads.
We can also observe that, contrary to what someone might argue, a substantial number
of variables or variable uses is not a problem for BDX. Indeed, for a small number of
variables - around 6 to 8 - performance is reduced, but for more variables performance
increase considerably. The explanation for this is that in the serial version of the program
the compiler can keep some of these variables in registers which produce very fast serial
execution and therefore BDX relatively performs worst since these variables now will be
stored in memory. However, when the number of variables increase the compiler cannot
keep all variables in registers for the serial version and/or must insert instructions to
handle variable spills which reduce the performance of the serial version, but on the other
CHAPTER 4. EXPERIMENTAL RESULTS 46
2 4 6 8 10 12 14 16
Number of Variables
2
4
6
8
10
12
14
16
18
20
Nu
m
be
r o
f U
se
s
0.60
0.75
0.90
1.05
1.20
1.35
1.50
1.65
1.80
(a) Speedup Variation.
2 4 6 8 10
Number of Variables
2
4
6
8
10
Nu
m
be
r o
f U
se
s
0.60
0.75
0.90
1.05
1.20
1.35
1.50
1.65
(b) Close-up on bottom-left corner.
Figure 4.3: Performance speedup of BDX in function of the number of loop-dependent
dependences and uses. Colors closer to red represent larger speedups.
hand it will improve the relative performance of BDX. The bottom line is that when the
loop uses more variables than the number of available registers, and therefore the compiler
need to insert spill code, the performance of the serial program will be degraded and this
will alleviate the performance of BDX, which always uses memory to enable inter-stage
and inter-thread communication.
This result is important because it could also be argued that, since BDX uses buffers
to store these variables, accessing these variables could produce large overheads due to
indexing and locality. Results does not indicate this and in fact support the idea that the
more variables the loop uses/reuse the better the speedup. Overall, we see that for above
5 variables with 5 uses each the stage is sufficiently large to amortize communication
overhead.
4.2.3 Effect of the number of Loop-carried dependences
Figure 4.3a shows BDX performance as a function of the number of loop-dependent de-
pendences (variables) and the number of times each variable is used. For instance, the
point (6, 8) is filled with a color representing the speedup of a parallelized benchmark that
contains 6 loop-dependent variables (half in each stage, for all cases), where each of them
is used 8 times inside each stage. As before, colors close to red represent larger speedups.
Figure 4.3b is a close-up on the region of at most 10 variables and 10 uses.
For this experiment several synthetic benchmarks, all of them with two stages, per-
fectly balanced, were used. The only difference between these benchmarks is the number
of loop-dependent variables and the number of times that each variable is used. As be-
fore all stages have a simulated payload and are surrounded by synchronization directives.
Notice that in this experiment there is a real need for the synchronization directives, since
the stages have loop-carried dependences. Furthermore, instructions were inserted before
and after each stage to copy from/to the shared global variables. As in the previous ex-
periment, stage size was fixed at 10 and the batch size at 40. As before, the loop executes
1,000,000 iterations and the results reported are an average of 10 executions.
CHAPTER 4. EXPERIMENTAL RESULTS 47
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
B
D
X
D
O
A
X
-U
D
S
W
P
0%
20%
40%
60%
80%
100%
F
er
re
t-
S
im
la
rg
e
B
o
d
y
tr
ac
k
-S
im
la
rg
e
B
zi
p
2-
P
ro
gr
am
R
ot
at
e
R
ot
C
C
R
gb
Y
u
v
M
C
F
-T
ra
in
R
ay
R
ot
H
m
m
er
-T
ra
in
H
26
3D
ec
-M
ed
iu
m
H
26
3D
ec
-S
m
al
l
Computation Communication
Imbalanced Stages Balanced Stages
Coarse Grained Fine Grained
Figure 4.4: Communication and computation execution time partitioning.
Results for this experiment also show a symmetric pattern like the one observed in
Figure 4.2a. Nevertheless, we note that there is a larger overhead for a small number
of variables/dependences as compared to results of the previous section - this is evident
by the larger fraction in blue near the bottom-left corner of the grid. Moreover, if we
compare the top-right part of Figure 4.2a and Figure 4.3a we will notice that overall
speedups obtained in this section were smaller - there is a predominance of pale colors in
the latter figure.
The larger overhead for a small number of variables/uses is a result of execution
of instructions to fetch volatile shared variables and to store them in local variables.
As a consequence a larger number of uses of the variable is necessary to enlarge the
stage size and amortize this communication overhead. Situations where there are several
loop-carried dependences but few uses of them may suggest that the static dependence
analysis was not quite precise and thus an improvement on this direction would improve
the performance of the parallel loop. Another observation worth mentioning at this point
is that usually the number of loop-carried dependences is smaller than the number of
loop-independent dependences.
Overall, by looking at Figure 4.3a one can say that good speedups can be achieved
when at least 6 uses of 6 loop dependent variables are present in the loop body.
One conclusion from this set of sensibility experiments is that increasing the size of
the stages, be it by using a larger batch size, a larger number of uses of loop-independent
or loop-dependent variables, will help to speed up the parallel loop.
4.3 Communication Overhead - Quantitative Analysis
Our first experiment in this section aims at measuring the impact of communication and
synchronization of each algorithm on the performance of parallelized applications. Fig-
ure 4.4 shows, for each parallelized loop and algorithm, the partitioning of the execution
time (i.e., cycle count) between computation and communication. The fraction at the top
of each bar represents the percentage of the Central Processing Unit (CPU) time of the
CHAPTER 4. EXPERIMENTAL RESULTS 48
loop spent executing instructions to communicate data and/or perform synchronization
between threads. The fraction at the bottom represents the percentage of CPU time that
the parallel version took to execute instructions corresponding to the original loop body
- that is, instructions not related to the parallelization algorithm. This information was
collected from the smaller execution time that each algorithm took to execute the paral-
lelized loops. As the figure shows, communication management is a prominent issue in
all algorithms. This experiment not only confirms that, but also enables one to compare,
side-by-side, the overhead imposed to each algorithm when dealing with communica-
tion/synchronization. Notice from the graph that the smallest communication fraction
was reached by BDX parallelized version of H263Dec which spent approximately 30% of
CPU time handling communication. Overall, BDX consistently presented the smallest
communication overhead for all programs, followed by DOAX-U. Next, we comment the
most relevant detail about each program.
For imbalanced coarse-grained benchmarks (Ferret, Bodytrack, Bzip2, Rotate, RotCC
and RgbYuv) the communication fraction is relatively smaller and approaches 50% of
the total. This can be explained by the fact that these programs are highly imbalanced.
Therefore, while one thread is executing useful work on the “heavy” stage the other thread
is essentially idle, but nonetheless keeping the CPU busy checking for synchronization.
The net effect is that the time spent in communication (mostly for synchronization) was
almost the same as the time used to perform effective computation.
Another aspect that deserves attention was that BDX and DOAX-U seemed to be
more sensitive to highly imbalanced stages. This was particularly evident in Ferret. For
the BDX and DOAX-U parallelized versions of these benchmark, we observed that the
communication time among all threads was frequently larger than the communication
time of those parallelized with DSWP. A careful analysis led us to conclude that in these
cases DSWP does a decent job of hiding communication latency, as DSWP parallelized
loops can keep one of the threads/core busy most of the time by executing the same (in
this case large) stage. However, in BDX and DOAX-U the code for all stages are present
in all threads and thus put more pressure on the caches and other hardware units that
exploit code locality.
For programs with fine granularity loops (Hmmer, MCF, RayRot and H263Dec) the
execution time of DSWP parallelized version was dominated by the cost of executing
instructions to manage the queue. Hmmer-Test was the only case where DSWP did
relatively well. However, for this input the target loop has a constant trip count of only 10
iterations which leaves little room for creating large batches and exploring BDX potential.
The Hmmer-Train’s input provides a much larger trip count and is more representative
of what happens in fine granularity loops.
Overall, among all algorithms, BDX presented the largest share of computation in
almost all benchmarks, with the biggest differences occurring in H263Dec, Hmmer and
MCF. BDX was followed by DOAX-U; since DOAX-U is an unrolled version of the original
loop, it has the potential to expose more optimizations to the compiler (and in fact we
have measured that it improved, for instance, ILP for a few cases). Nevertheless, as shown
in the next subsection, this can be adversely affected by DOAX-U increased code size.
CHAPTER 4. EXPERIMENTAL RESULTS 49
1
x
2x 5
x
10
x
50
x
0.5
1
1.5
B
D
X
S
p
ee
d
u
p
H263Dec-Small
1
x
2
x
5x
10
x
50
x
0.5
1
1.5
MCF-Train
1
x
2
x
5
x
1
0x
5
0x
1.3
1.4
1.5
1.6
Ferret-Simlarge
1x 2x 5x 10
x
50
x
0.5
1
1.5
D
O
A
X
-U
S
p
ee
d
u
p
H263Dec-Small
1x 2x 5x 10
x
50
x
0.8
1
1.2
1.4
MCF-Train
1x 2x 5x 10
x
50
x
1
1.2
1.4
1.6
Ferret-Simlarge
Intel ARM
Figure 4.5: ARM & Intel speedups over sequential execution for BDX and DOAX-U loops.
4.4 Fast Computation vs Fast Communication
As shown in the previous section, communication/synchronization overhead account for
a large fraction of the parallelized loop execution time. Therefore, since different ar-
chitectures can have very different characteristics in terms of computational power and
communication latency, one should analyze the impact that different architectures have
in the execution of loop parallelization algorithms. In this section, we present results to
help answer that question. In this experiment, we only used BDX and DOAX-U, as they
showed the best computation/communication results of all three algorithms. Figure 4.5
presents the speedup (y axis) reached when using BDX (top row) and DOAX-U (bottom
row) to parallelize programs, when considering different batch sizes (x axis), for both
ARM (red line) and Intel (blue line) targets. For the experiment three programs were se-
lected to represent Coarse-Imbalanced (Ferret-Simlarge), Fine-Imbalanced (MCF-Train)
and Fine-Balanced (H263Dec-Small) benchmark categories.
In Figure 4.5 for fine-granularity representatives, and small values of batch size, the
speedup (of both algorithms) on the ARM architecture was greater than its counterpart
on Intel. This is probably due to the differences in communication and computation
capabilities of these two architectures, which shows up more clearly in fine-granularity
loops. The average CPI (computation) of a typical ARM core is smaller than the one
of an Intel core [13]. On the other hand, the latency to forward cache-lines during a
contested access (communication) between cores in the Intel architecture is larger than
on an ARM target [7, 45]. Therefore, for small values of batch sizes, Intel cores spend
more time doing synchronization than ARM cores. However, for bigger batch sizes the
powerful Intel processor plays a key role in the execution time than the cache coherency
CHAPTER 4. EXPERIMENTAL RESULTS 50
system, while an ARM core takes more time to execute the whole iteration batch. In
other words, with smaller batch sizes the number of times that the program requires to
communicate is larger, and so ARM cores perform better. On the other hand, when
the number of communication operations decrease, and the loop body is larger (coarse
grained), instruction execution capability is more important and the Intel core performs
better.
One can also measure batch size in terms of loop body size. That is, smaller batch
sizes relate to smaller loop bodies and larger batch sizes correspond to coarse granularity
loops. When analyzing the results of Figure 4.5, one can conclude that, in fact, the
number of loops that BDX and DOACROSS can successfully speedup is much larger in the
ARM architecture than in Intel. Another way to interpret these results is the following.
Given that loop parallelism demands a large amount of communication, architectures
using expensive inter-core communication, like Intel, considerably limit the applicability of
these algorithms, particularly for small loop bodies. On the other hand, loops containing
large loop bodies do sufficient work, at each iteration, to “amortize” the communication
overhead, and thus batching does not have much effect.
However, for larger batch sizes, Intel powerful processor considerably reduces the batch
execution time, while an ARM core takes more time to execute the same iteration batch.
Moreover, for larger batch sizes communication frequency decreases, further reducing its
overhead.
Finally, although overall results of both algorithms were similar, one difference was
very evident in our experiments for ARM. The DOAX-U implementation did not perform
well (mainly on ARM) when large unroll factors were applied to coarse grained loops. In
fact, as we increased batch size (see Ferret for instance) speedup decreases. The reason
is that in DOAX-U, as loop body increases (bigger batch sizes), more pressure is put
onto the instruction and unified caches, thus limiting speedup. On the other hand, unlike
DOAX-U, BDX does a form of dynamic unrolling, making its performance better for this
type of loop.
4.5 Reducing Communication Overhead
Now that we have analyzed the communication/computation trade-off in two different
architectures, one should devise ways to reduce communication overhead in such algo-
rithms. Figure 4.6 shows, again for BDX and DOAX-U (blue and red line, respectively),
the number of coherency initiated cache lines forwarded from core-to-core (y-axis, in log
scale), for several values of batch size (x-axis), when executing on Intel (top row) and
ARM (bottom row). Cache line forwarding is needed whenever a thread executing on
a core requires a cache line that is present in a private cache of another core. Usually
the cost of such operation is much smaller than the cost of fetching data from DRAM,
however it is still an operation that takes a few dozen cycles, and is usually the cause of
the communication overhead shown in Figure 4.4.
Although we show here only the number of forwarded cache lines, we observed the
same effect for several other architecture counters such as the number of memory accesses
CHAPTER 4. EXPERIMENTAL RESULTS 51
1x 2x 5
x
10
x
50
x
108
109
In
te
l
C
ac
h
e
L
in
es
F
o
rw
ar
d H263Dec-Small
1x 2x 5x 10
x
50
x
106.5
107
MCF-Train
1x 2
x
5
x
10
x
50
x
105
105.1
Ferret-Simlarge
1x 2x 5x 10
x
50
x
109
1010
A
R
M
C
ac
h
e
L
in
es
F
o
rw
ar
d H263Dec-Small
1x 2x 5x 10
x
50
x
108.5
109
MCF-Train
1x 2x 5x
10
x
50
x10
7.18
107.2
107.22
Ferret-Simlarge
BDX DOAX-U
Figure 4.6: Number of core-to-core cache lines forwarded as a function of batch size.
and L1/L2 cache hit and misses2. Overall, the figures reveal a dramatic reduction in the
number of contested cache lines accesses, as we increase batch size.
As expected, results for BDX and DOAX-U were similar, which indicate that regard-
less of other effects, both improve performance when using batching. BDX behavior is
very similar across targets and could reduce the number of communication operations
in all programs. However, DOAX-U requires a more detailed analysis. Although static
unrolling has the potential to enable additional optimizations, the interaction of the large
loop body with the micro-architecture may also result in unpredictable effects during
program execution. For example, an increased loop body size can put more pressure on
unified cache structures (for instance ARM L2 caches). This is noticeable for loops which
originally already had large bodies (the case of Ferret). Furthermore, as mentioned before,
DOAX-U requires forwarding synchronization flags at every iteration.
4.6 Application Speedup and Overall Results
After analyzing the communication/computation trade-off of each algorithm, for architec-
tures with different memory hierarchies, and analyzing an approach to minimize communi-
cation overhead, this section shows programs’ final speedups. Figure 4.7 shows execution
time speedup obtained using each algorithm, with respect to original serial implementa-
tion, for both target architectures. Results for BDX, DSWP and DOAX-U are the best
speedups obtained after analyzing the parameters of each algorithm. The goal of this ex-
2The counters monitored were the following: Ivy Bridge: STALLS_L1D_PENDING,
STALLS_L2_PENDING, STALLS_LDM_PENDING and for ARM A9 MPCore: Data cache dependent
stall cycles (0x61) and Data Access (0x04).
CHAPTER 4. EXPERIMENTAL RESULTS 52
Imbalanced Stages Balanced Stages
Coarse Grained Fine Grained
0
0.5
1
1.5
2
S
p
ee
d
u
p
on
In
te
l
F
er
re
t-
S
im
la
rg
e
B
o
d
y
tr
ac
k
-S
im
la
rg
e
H
m
m
er
-T
ra
in
M
C
F
-T
ra
in
B
zi
p
2-
P
ro
gr
am
H
26
3D
ec
-M
ed
iu
m
H
26
3D
ec
-S
m
al
l
R
ot
at
e
R
ot
C
C
R
ay
R
ot
R
gb
Y
u
v
0
0.5
1
1.5
2
S
p
ee
d
u
p
on
A
R
M
BDX DSWP DOAX DOAX-U
Figure 4.7: Speedup of BDX, DOAX, DOAX-U and DSWP for all programs on Intel and
ARM.
CHAPTER 4. EXPERIMENTAL RESULTS 53
periment is to compare how well these algorithms perform against each other and across
different architectures altogether.
Programs with coarse grained and highly imbalanced stages did not leave room for
speedups. However, stages of their loops execute enough work to amortize the overhead
of managing parallelism and, consequently, they did not suffer major slowdowns (min. of
0.98). In contrast, programs with fine grained and highly imbalanced stages (i.e., MCF)
also provided little opportunities for speedup. However, in this scenario, it is very difficult
to prevent slowdowns, since the overhead of the instructions to manage parallel execution
increases due to the small loop body/stages. In this situation, we observed that it is
important for the parallelization algorithm to keep a low instruction footprint, and thus,
BDX and DOAX-U are the best (see Table 4.3) options to minimize the potential of
slowdowns, with BDX performing better than DOAX-U overall.
As some examples, consider these four programs and their results on the Intel target:
Rotate, RgbYuv, RotCC and RayRot and Table 4.2 and Table 4.3 for their characteristics.
For these programs, the performance of the four algorithms was about the same. The
program Rotate has three stages and is reasonably balanced (15%-63%-22%), as a result it
was parallelized using three threads and produced a speedup of 1.4x. The major limitation
to speedup in this program was stage imbalance. Notice that, program RotCC also has
three stages but is a bit more imbalanced (16%-75%-9%) and was parallelized the same
way, but produced a speedup of 1.2x. The program RgbYuv has four stages (2%-88%-
9%-1%) and was parallelized using four threads. However, since its largest stage (the
second one) is quite large the performance was about the same as the serial. On the other
hand, the program RayRot has five stages (3%-3%-82%-7%-3%), was parallelized with
five threads, but its largest stage (the third one) is still very small (in absolute numbers)
and therefore the final performance was 0.9x of the serial - a slowdown of 10%.
As shown, BDX performed equal, and in many cases better than all other approaches,
even for fine grained loops (Like Hmmer, MCF and H263Dec). Moreover, the performance
of BDX was the most stable across all benchmarks and when neither algorithm provided
speed up the slowdown resulting from BDX was the smallest.
When considering only the results of fine grained benchmarks for the Intel architecture,
one can see that DSWP did not perform well. Even worse was the performance of DOAX.
Nevertheless, DOAX-U could significantly improve the performance of the original DOAX,
showing speed up in several cases.
When considering all the results of the ARM target we observed that BDX minimum
speedup was 0.95 for MCF-Train. In ARM, DSWP performance improved when compared
to its performance on the Intel core, however it still performed worse than BDX and
DOAX-U. Overall the best performing algorithm was BDX, followed by DOAX-U and
then DSWP.
Figure 4.8 shows the application speedup for programs parallelized using PS-BDX
and PS-DSWP. Results are shown only for programs that have at least one parallel stage.
Consecutive parallel stages were grouped together to form only one larger stage. The
speedup in this experiment was much larger than that shown in Figure 4.7; the reason
is that PS-BDX and PS-DSWP can further improve upon BDX/DSWP by exploiting
DOALL parallelism present in some stages of the pipeline. The speedups of PS-BDX
CHAPTER 4. EXPERIMENTAL RESULTS 54
B
D
X
D
S
W
P
B
D
X
D
S
W
P
B
D
X
D
S
W
P
B
D
X
D
S
W
P
B
D
X
D
S
W
P
B
D
X
D
S
W
P
B
D
X
D
S
W
P
0
1
2
3
4
5
A
p
p
lic
at
io
n
S
p
ee
d
u
p
2 threads
4 threads
8 threads
Ferret KS Otter Rotate RotCC RayRot RgbYuv
Figure 4.8: Application speedup of programs parallelized with PS-BDX and PS-DSWP.
Highly Imbalanced Well Balanced Fine Grain Coarse Grain
BDX X X X
DSWP X X
DOAX-U X X
Intel X X X
ARM X X X
Table 4.4: Which algorithm / architecture is best for each type of loop.
were larger than PS-DSWP because the latter suffers from the same problem as DSWP,
complex inter-core/thread communication mechanism.
Overall the only observed slowdown in this experiment was in the DSWP 2 threads
version of KS. Five (KS, Rotate, RotCC, RayRot and RgbYuv) out of seven programs
scaled well up to four threads. However, using eight threads provided only marginal im-
provement over the four threads version for most of the benchmarks: Ferret, KS, RayRot
and RgbYuv. Analyzing Rotate and RotCC we observed that the parallel stage in each
program is not sufficiently larger than the combination of the serial ones to accommodate
eight parallel threads executing useful work. The parallel stage in Rotate accounts for 63%
of the execution, which split by eight threads is not larger than the sum of the execution
time of the serial stages (37%). The same reasoning holds for Ferret and RotCC where
the parallel stage execution time is not sufficiently larger (> 8x) than the combination of
the execution time of the serial ones.
The conclusion from implementing these algorithms and analyzing their results is that
PS-BDX is considerably simpler than PS-DSWP and yet may produce improved results.
While the implementation of PS-DSWP often require the use of a Reorder Buffer to enable
communication from parallel stages to sequential stages, in PS-BDX this orchestration is
implicitly implemented due the synchronization flags. Moreover, when only two threads
are used in the pipeline, PS-BDX will certainly produce better results because PS-DSWP
will execute exactly as DSWP but PS-BDX will let the two threads overlap the execution
of the parallel stage.
CHAPTER 4. EXPERIMENTAL RESULTS 55
Table 4.4 shows a summary of the results presented considering two main factors:
stage balance and loop/stage granularity. DSWP performs better for imbalanced loops
since there is not much room for parallelism anyway, and DSWP keeps the communication
latency out of the critical path. On the other hand, BDX and DOAX-U are better suited
for balanced loops; in this scenario, there is room to exploit parallelism, while batching
can hide communication latency. DSWP does not perform well for well-balanced loops,
as the overhead to manage the inter core queue affects all stages.
Now consider loop body granularity. For fine grained loops, DSWP performs poorly
due to the overhead to manage its software queue. BDX and DOAX-U are able to speed
up fine-grained loops because they use batching to mitigate the costs of communication.
Nevertheless, DOAX-U is not suitable for coarse grained benchmarks as large static unroll
factors can cause adverse effects on the cache hit ratio. For coarse grained benchmarks,
BDX or DSWP are better options when compared to DOAX-U, since they do not increase
the loop body size and do a better job in amortizing communication.
As shown in Table 4.4 load balancing did not play a significant role when varying
the target architecture. However, ARM is a better target for fine grained loops since its
communication latency is low, while Intel works for large loops because, in this scenario,
a program can take advantage of its powerful cores and amortize communication costs.
Overall, the above experiments support three conclusions:
(1) Inter-core communication is central to loop parallelization. Nevertheless, neither
ARM A9 MPCore nor Intel Ivy Bridge have an ideal combination of core perfor-
mance and inter-core communication latency to support efficient parallelization of
loops using DOAX and DSWP;
(2) Overall, BDX seems to inherit the best features of the other two algorithms, by
combining the low communication overhead of DOAX-U, with the ability to deal
with large loop bodies of DSWP;
(3) This study endorses the idea that some interesting speedups could emerge if hard-
ware mechanisms are provided to support efficient inter-core communication.
56
Chapter 5
Conclusions
Parallelization of cross-iteration dependent loops has been considered a very hard problem.
Many variations of DOACROSS, and more recently, Decoupled Software Pipeline have
been proposed to address this problem. Although these algorithms have offered solutions
for many loops, they still suffer from synchronization overheads.
We have shown that two of the most widespread modern computer architectures (Intel
Ivy Bridge and ARM A9 MPCore) do not provide a well-balanced combination of inter-
core communication latency and instruction execution capabilities. Consequently, for
small loops, the ARM A9 MPCore, due its relatively small inter-core communication
latency, provides more ground for loop parallelization than Ivy Bridge. However, the
Ivy Bridge cores are sufficiently faster than ARM A9 cores and that makes a substantial
difference for parallelization of coarse loops. The net result is that neither of the two
architectures, support the efficient parallelization of both fine and coarse loops.
We proposed a new loop parallelization algorithm, called BDX, and one extension to
this base algorithm, called PS-BDX, that can capitalize over the best features of DSWP
and DOACROSS. Our results showed that this algorithm, due to its low code footprint
and efficient communication capability, can successfully speed up loops that DOACROSS
and DSWP were not able to.
Another observation is that most hot loops found in the regular C/C++ applications
that we studied are very small loops composed of only a couple instructions and frequently
doing extensive uses of pointers. The intuition behind this is that these code regions were
already hand tuned by the programmer to its most efficient version. In this scenario, it
is quite difficult to produce speedups and even to prevent slowdowns because the process
of parallelization adds several new instructions to the loop and frequently restructure the
memory access pattern. With that in mind BDX was designed to keep the modifications
to the original loop to a minimum and uses a simplistic communication scheme.
We do not claim that BDX is the best option for all loops. On the contrary, we argue
that there is no silver bullet for this problem. However, we designed BDX to use a very
lightweight communication mechanism and thus be effective to handle communication
even in very small loops. Our results show that BDX performs comparatively better (or
at least on par) with current loop parallelization algorithms and for that reason we plan
to release an extension of the CLang compiler that enables the user to parallelize a loop
by simply annotating it with OpenMP pragmas.
57
Bibliography
[1] Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. Compilers Prin-
ciples, Techniques, & Tools. 2007.
[2] A. Aiken and A. Nicolau. Optimal loop parallelization. In Proceedings of the ACM
SIGPLAN 1988 conference on Programming Language design and Implementation,
PLDI ’88, pages 308–317, New York, NY, USA, 1988. ACM.
[3] Vicki H Allan, Reese B Jones, Randall M Lee, and Stephen J Allan. Software pipelin-
ing. ACM Computing Surveys (CSUR), 27(3):367–432, 1995.
[4] Michael Andersch, Ben Juurlink, and Chi Ching Chi. A benchmark suite for evaluat-
ing parallel programming models. In Proceedings 24th Workshop on Parallel Systems
and Algorithms, 2011.
[5] James P. Anderson. Program structures for parallel processing. Commun. ACM,
8(12):786–788, December 1965.
[6] Jennifer M. Anderson, Lance M. Berc, Jeffrey Dean, Sanjay Ghemawat, Monika R.
Henzinger, Shun-Tak A. Leung, Richard L. Sites, Mark T. Vandevoorde, Carl A.
Waldspurger, and William E. Weihl. Continuous profiling: Where have all the cy-
cles gone? In Proceedings of the Sixteenth ACM Symposium on Operating Systems
Principles, SOSP ’97, pages 1–14, New York, NY, USA, 1997. ACM.
[7] ARM. Cortex-A9 MPCore Technical Reference Manual. 2012.
[8] Carl Baker, Henry; Hewitt. Laws for communicating parallel processes. MIT Artificial
Intelligence Laboratory Working Papers, page 18, Nov. 1976.
[9] Utpal Banerjee. Speedup of Ordinary Programs. PhD thesis, Champaign, IL, USA,
1979. AAI8008967.
[10] V. R. Basili and J. C. Knight. A language design for vector machines. In Proceedings
of the Conference on Programming Languages and Compilers for Parallel and Vector
Machines, pages 39–43, New York, NY, USA, 1975. ACM.
[11] Christian Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton
University, January 2011.
BIBLIOGRAPHY 58
[12] Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. Power struggles:
Revisiting the risc vs. cisc debate on contemporary arm and x86 architectures. In High
Performance Computer Architecture (HPCA2013), 2013 IEEE 19th International
Symposium on, pages 1–12. IEEE, 2013.
[13] Emily Blem, Jaikrishnan Menon, and Karthikeyan Sankaralingam. Power struggles:
Revisiting the risc vs. cisc debate on contemporary arm and x86 architectures. In
Proceedings of the 2013 IEEE 19th International Symposium on High Performance
Computer Architecture (HPCA), HPCA ’13, pages 1–12, Washington, DC, USA,
2013. IEEE Computer Society.
[14] Shekhar Borkar and Andrew A Chien. The future of microprocessors. Communica-
tions of the ACM, 54(5):67–77, 2011.
[15] Simone Campanoni, Timothy Jones, Glenn Holloway, Vijay Janapa Reddi, Gu-Yeon
Wei, and David Brooks. Helix: automatic parallelization of irregular programs for
chip multiprocessing. In Proceedings of the Tenth International Symposium on Code
Generation and Optimization, CGO ’12, 2012.
[16] Simone Campanoni, Timothy Jones, Glenn Holloway, Gu-Yeon Wei, and David
Brooks. The helix project: Overview and directions. In Proceedings of the 49th
Annual Design Automation Conference, DAC ’12, pages 277–282, New York, NY,
USA, 2012. ACM.
[17] Simone Campanoni, Timothy M Jones, Glenn Holloway, Gu-Yeon Wei, and David
Brooks. Helix: making the extraction of thread-level parallelism mainstream. IEEE
Micro, 32(4):0008–18, 2012.
[18] W. R. Chen, W. Yang, and W. C. Hsu. A lock-free cache-friendly software queue
buffer for decoupled software pipelining. In 2010 International Computer Symposium
(ICS2010), pages 997–1006, Dec 2010.
[19] Doreen Y Cheng. A survey of parallel programming languages and tools. Computer
Sciences Corporation, NASA Ames Research Center, Report RND-93-005 March,
1993.
[20] E. F. Codd, E. S. Lowry, E. McDonough, and C. A. Scalzi. Multiprogramming
stretch: Feasibility considerations. Commun. ACM, 2(11):13–17, November 1959.
[21] Melvin E. Conway. A multiprocessor system design. In Proceedings of the November
12-14, 1963, Fall Joint Computer Conference, AFIPS ’63 (Fall), pages 139–146, New
York, NY, USA, 1963. ACM.
[22] Ron Cytron. Doacross: Beyond vectorization for multiprocessors. In ICPP, pages
836–844, 1986.
[23] Ronald Gary Cytron. Compile-time Scheduling and Optimization for Asynchronous
Machines (Multiprocessor, Compiler, Parallel Processing). PhD thesis, Champaign,
IL, USA, 1984. AAI8502121.
BIBLIOGRAPHY 59
[24] D. César, R. Auler, R. Dalibera, S. Rigo, E. Borin, and G. Araújo. Modeling vir-
tual machines misprediction overhead. In 2013 IEEE International Symposium on
Workload Characterization (IISWC), pages 153–162, Sept 2013.
[25] Jack B. Dennis. First version of a data flow procedure language, pages 362–376.
Springer Berlin Heidelberg, Berlin, Heidelberg, 1974.
[26] Evelyn Duesterwald and Vasanth Bala. Software profiling for hot path prediction:
Less is more. SIGPLAN Not., 35(11):202–211, November 2000.
[27] R. Duncan. A survey of parallel computer architectures. Computer, 23(2):5–16, Feb
1990.
[28] Hadi Esmaeilzadeh, Emily Blem, Renée St Amant, Karthikeyan Sankaralingam, and
Doug Burger. Power challenges may end the multicore era. Communications of the
ACM, 56(2):93–102, 2013.
[29] Hadi Esmaeilzadeh, Emily Blem, Renee St Amant, Karthikeyan Sankaralingam, and
Doug Burger. Dark silicon and the end of multicore scaling. In ACM SIGARCH
Computer Architecture News, volume 39, pages 365–376. ACM, 2011.
[30] Hadi Esmaeilzadeh, Emily Blem, Renée St Amant, Karthikeyan Sankaralingam, and
Doug Burger. Power limitations and dark silicon challenge the future of multicore.
ACM Transactions on Computer Systems (TOCS), 30(3):11, 2012.
[31] Jeanne Ferrante, Karl J. Ottenstein, and Joe D. Warren. The program dependence
graph and its use in optimization. volume 9, pages 319–349, New York, NY, USA,
July 1987. ACM.
[32] M. J. Flynn. Very high-speed computing systems. Proceedings of the IEEE,
54(12):1901–1909, Dec 1966.
[33] Jason E. Fritts, Frederick W. Steiling, Joseph A. Tucek, and Wayne Wolf. Media-
bench ii video: Expediting the next generation of video systems research. Micropro-
cess. Microsyst., 33(4), 2009.
[34] John Giacomoni, Tipp Moseley, and Manish Vachharajani. Fastforward for efficient
pipeline parallelism: A cache-optimized concurrent lock-free queue. In Proceedings
of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Pro-
gramming, PPoPP ’08, pages 43–52, New York, NY, USA, 2008. ACM.
[35] S. Gill. Parallel programming. The Computer Journal, 1(1):2, 1958.
[36] P. A. Gilmore. Structuring of parallel algorithms. J. ACM, 15(2):176–192, April
1968.
[37] Rehan Hameed, Wajahat Qadeer, Megan Wachs, Omid Azizi, Alex Solomatnikov,
Benjamin C Lee, Stephen Richardson, Christos Kozyrakis, and Mark Horowitz. Un-
derstanding sources of inefficiency in general-purpose chips. In ACM SIGARCH
Computer Architecture News, volume 38, pages 37–47. ACM, 2010.
BIBLIOGRAPHY 60
[38] Nikos Hardavellas, Michael Ferdman, Babak Falsafi, and Anastasia Ailamaki. Toward
dark silicon in servers. IEEE Micro, 31(4):6–15, 2011.
[39] John L Hennessy and David A Patterson. Computer architecture: a quantitative
approach. Elsevier, 2011.
[40] John L. Hennessy and David A. Patterson. Computer Architecture, Fifth Edition:
A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 5th edition, 2011.
[41] John L. Henning. Spec cpu2006 benchmark descriptions. SIGARCH Comput. Archit.
News, 34(4):1–17, September 2006.
[42] Jialu Huang, Thomas B. Jablin, Stephen R. Beard, Nick P. Johnson, and David I.
August. Automatically exploiting cross-invocation parallelism using runtime infor-
mation. In Proceedings of the 2013 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO), CGO ’13, pages 1–11, Washington, DC, USA,
2013. IEEE Computer Society.
[43] Jialu Huang, Arun Raman, Thomas B. Jablin, Yun Zhang, Tzu-Han Hung, and
David I. August. Decoupled software pipelining creates parallelization opportuni-
ties. In Proceedings of the 8th annual IEEE/ACM international symposium on Code
generation and optimization, CGO ’10, pages 121–130, New York, NY, USA, 2010.
ACM.
[44] Ali R. Hurson, Joford T. Lim, Krishna M. Kavi, and Ben Lee. Parallelization of doall
and doacross loops - a survey. Advances in Computers, 45:53–103, 1997.
[45] Intel. Intel 64 and IA-32 Architectures Optimization Reference Manual. 2013.
[46] Thomas B Jablin, Yun Zhang, James A Jablin, Jialu Huang, Hanjun Kim, and
David I August. Liberty queues for epic architectures. In Proceedings of the Eigth
Workshop on Explicitly Parallel Instruction Computer Architectures and Compiler
Technology (EPIC), 2010.
[47] Henry Kasim, Verdi March, Rita Zhang, and Simon See. Survey on Parallel Pro-
gramming Model, pages 266–275. Springer Berlin Heidelberg, Berlin, Heidelberg,
2008.
[48] Ken Kennedy and John R. Allen. Optimizing compilers for modern architectures: a
dependence-based approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2002.
[49] Hanjun Kim, Nick P. Johnson, Jae W. Lee, Scott A. Mahlke, and David I. August.
Automatic speculative doall for clusters. In Proceedings of the Tenth International
Symposium on Code Generation and Optimization, CGO ’12, pages 94–103, New
York, NY, USA, 2012. ACM.
BIBLIOGRAPHY 61
[50] P. M. Kogge. Parallel solution of recurrence problems. IBM J. Res. Dev., 18(2):138–
148, March 1974.
[51] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs
and compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT
Symposium on Principles of Programming Languages, POPL ’81, pages 207–218,
New York, NY, USA, 1981. ACM.
[52] David J. Kuck. A survey of parallel machine organization and programming. ACM
Comput. Surv., 9(1):29–59, March 1977.
[53] S. Y. Kung, S. C. Lo, S. N. Jean, and J. N. Hwang. Wavefront array processors-
concept to implementation. Computer, 20(7):18–33, July 1987.
[54] L. Lamport. How to make a multiprocessor computer that correctly executes multi-
process programs. IEEE Transactions on Computers, C-28(9):690–691, Sept 1979.
[55] Leslie Lamport. The parallel execution of do loops. Commun. ACM, 17(2):83–93,
February 1974.
[56] Leslie Lamport. Specifying concurrent program modules. ACM Trans. Program.
Lang. Syst., 5(2):190–222, April 1983.
[57] Patrick PC Lee, Tian Bu, and Girish Chandranmenon. A lock-free, cache-efficient
shared ring buffer for multi-core architectures. In Proceedings of the 5th ACM/IEEE
Symposium on Architectures for Networking and Communications Systems, pages
78–79. ACM, 2009.
[58] P.P.C. Lee, T. Bu, and G. Chandranmenon. A lock-free, cache-efficient multi-core
synchronization mechanism for line-rate network traffic monitoring. In Parallel Dis-
tributed Processing (IPDPS), 2010 IEEE International Symposium on, pages 1–12,
2010.
[59] Sanghoon Lee, Devesh Tiwari, Yan Solihin, and James Tuck. Haqu: Hardware-
accelerated queueing for fine-grained threading on a chip multiprocessor. In Proceed-
ings of the 2011 IEEE 17th International Symposium on High Performance Computer
Architecture, HPCA ’11, pages 99–110, Washington, DC, USA, 2011. IEEE Computer
Society.
[60] Meredydd Luff and Simon Moore. Asynchronous remote stores for inter-processor
communication. In In Future Architectural Support for Parallel Programming,
FASPP’12, 2009.
[61] Niel K. Madsen, Garry H. Rodrigue, and Jack I. Karush. Matrix multiplication by
diagonals on a vector/parallel processor. Information Processing Letters, 5(2):41 –
45, 1976.
BIBLIOGRAPHY 62
[62] Angeles Navarro, Rafael Asenjo, Siham Tabik, and Calin Cascaval. Analytical model-
ing of pipeline parallelism. In Proceedings of the 2009 18th International Conference
on Parallel Architectures and Compilation Techniques, PACT ’09, pages 281–290,
Washington, DC, USA, 2009. IEEE Computer Society.
[63] Allen Newell. On programming a highly parallel machine to be an intelligent tech-
nician. In Papers Presented at the May 3-5, 1960, Western Joint IRE-AIEE-ACM
Computer Conference, IRE-AIEE-ACM ’60 (Western), pages 267–282, New York,
NY, USA, 1960. ACM.
[64] OpenMP Architecture Review Board. OpenMP application program interface version
4.5, November 2015.
[65] Ascher Opler. Procedure-oriented language statements to facilitate parallel process-
ing. Commun. ACM, 8(5):306–307, May 1965.
[66] Guilherme Ottoni, Ram Rangan, Adam Stoler, and David I. August. Automatic
thread extraction with decoupled software pipelining. In Proceedings of the 38th an-
nual IEEE/ACM International Symposium on Microarchitecture, MICRO 38, pages
105–118, Washington, DC, USA, 2005. IEEE Computer Society.
[67] Peter S. Pacheco. An introduction to parallel programming. Morgan Kaufmann
Publishers Inc., 2011.
[68] David A. Padua and Michael J. Wolfe. Advanced compiler optimizations for super-
computers. Commun. ACM, 29(12):1184–1201, December 1986.
[69] David A Patterson and John L Hennessy. Computer Organization and Design RISC-
V Edition: The Hardware Software Interface. Morgan kaufmann, 2017.
[70] Keshav Pingali, Donald Nguyen, Milind Kulkarni, Martin Burtscher, M. Amber Has-
saan, Rashid Kaleem, Tsung-Hsien Lee, Andrew Lenharth, Roman Manevich, Mario
Méndez-Lojo, Dimitrios Prountzos, and Xin Sui. The tao of parallelism in algo-
rithms. In Proceedings of the 32nd ACM SIGPLAN conference on Programming
language design and implementation, PLDI ’11, pages 12–25, New York, NY, USA,
2011. ACM.
[71] Prakash Prabhu, Soumyadeep Ghosh, Yun Zhang, Nick P. Johnson, and David I.
August. Commutative set: a language extension for implicit parallel programming.
In Proceedings of the 32nd ACM SIGPLAN conference on Programming language
design and implementation, PLDI ’11, pages 1–11, New York, NY, USA, 2011. ACM.
[72] Vaughan R. Pratt and Larry J. Stockmeyer. A characterization of the power of vector
machines. Journal of Computer and System Sciences, 12(2):198 – 221, 1976.
[73] Thomas Preud’homme, Julien Sopena, Gael Thomas, and Bertil Folliot. Batchqueue:
Fast and memory-thrifty core to core communication. In Proceedings of the 2010
22Nd International Symposium on Computer Architecture and High Performance
BIBLIOGRAPHY 63
Computing, SBAC-PAD ’10, pages 215–222, Washington, DC, USA, 2010. IEEE
Computer Society.
[74] Rochit Rajsuman. System-on-a-Chip: Design and Test. Artech House, Inc., Nor-
wood, MA, USA, 1st edition, 2000.
[75] Arun Raman, Hanjun Kim, Thomas R. Mason, Thomas B. Jablin, and David I.
August. Speculative parallelization using software multi-threaded transactions. In
Proceedings of the fifteenth edition of ASPLOS on Architectural support for program-
ming languages and operating systems, ASPLOS XV, pages 65–76, New York, NY,
USA, 2010. ACM.
[76] Arun Raman, Hanjun Kim, Taewook Oh, Jae W. Lee, and David I. August. Paral-
lelism orchestration using dope: the degree of parallelism executive. In Proceedings
of the 32nd ACM SIGPLAN conference on Programming language design and imple-
mentation, PLDI ’11, pages 26–37, New York, NY, USA, 2011. ACM.
[77] Arun Raman, Ayal Zaks, Jae W. Lee, and David I. August. Parcae: a system for
flexible parallel execution. In Proceedings of the 33rd ACM SIGPLAN conference on
Programming Language Design and Implementation, PLDI ’12, pages 133–144, New
York, NY, USA, 2012. ACM.
[78] Easwaran Raman, Guilherme Ottoni, Arun Raman, Matthew J. Bridges, and David I.
August. Parallel-stage decoupled software pipelining. In Proceedings of the 6th annual
IEEE/ACM international symposium on Code generation and optimization, CGO
’08, pages 114–123, New York, NY, USA, 2008. ACM.
[79] Ram Rangan and David I August. Amortizing software queue overhead for pipelined
interthread communication. In Proceedings of the Workshop on Programming Models
for Ubiquitous Parallelism (PMUP), pages 1–5. Citeseer, 2006.
[80] Ram Rangan, Neil Vachharajani, Guilherme Ottoni, and David I. August. Perfor-
mance scalability of decoupled software pipelining. volume 5, pages 8:1–8:25, New
York, NY, USA, September 2008. ACM.
[81] Ram Rangan, Neil Vachharajani, Adam Stoler, Guilherme Ottoni, David I. August,
and George Z. N.q Cai. Support for high-frequency streaming in cmps. In Proceed-
ings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture,
MICRO 39, pages 259–272, Washington, DC, USA, 2006. IEEE Computer Society.
[82] Ram Rangan, Neil Vachharajani, Manish Vachharajani, and David I. August. De-
coupled software pipelining with the synchronization array. In Proceedings of the
13th International Conference on Parallel Architectures and Compilation Techniques,
PACT ’04, pages 177–188, Washington, DC, USA, 2004. IEEE Computer Society.
[83] Sean Rul, Hans Vandierendonck, and Koen De Bosschere. A profile-based tool for
finding pipeline parallelism in sequential programs. Parallel Comput., 36(9):531–551,
September 2010.
BIBLIOGRAPHY 64
[84] Richard M. Russell. The cray-1 computer system. Commun. ACM, 21(1):63–72,
January 1978.
[85] Randolf G Scarborough and Harwood G Kolsky. A vectorizing fortran compiler. IBM
J. Res. Dev., 30(2):163–171, March 1986.
[86] Michael B Taylor. Is dark silicon useful? harnessing the four horsemen of the com-
ing dark silicon apocalypse. In Design Automation Conference (DAC), 2012 49th
ACM/EDAC/IEEE, pages 1131–1136. IEEE, 2012.
[87] William Thies, Vikram Chandrasekhar, and Saman Amarasinghe. A practical ap-
proach to exploiting coarse-grained pipeline parallelism in c programs. In Proceedings
of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MI-
CRO 40, pages 356–369, Washington, DC, USA, 2007. IEEE Computer Society.
[88] C. D. Thompson and H. T. Kung. Sorting on a mesh-connected parallel computer.
Commun. ACM, 20(4):263–271, April 1977.
[89] Priya Unnikrishnan, Jun Shirako, Kit Barton, Sanjay Chatterjee, Raul Silvera, and
Vivek Sarkar. A practical approach to doacross parallelization. In Proceedings of the
18th international conference on Parallel Processing, Euro-Par’12, 2012.
[90] Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme
Ottoni, and David I. August. Speculative decoupled software pipelining. In Proceed-
ings of the 16th International Conference on Parallel Architecture and Compilation
Techniques, PACT ’07, pages 49–59, Washington, DC, USA, 2007. IEEE Computer
Society.
[91] Neil Vachharajani, Ram Rangan, Easwaran Raman, Matthew J. Bridges, Guilherme
Ottoni, and David I. August. Speculative decoupled software pipelining. In Proceed-
ings of the 16th International Conference on Parallel Architecture and Compilation
Techniques, PACT ’07, pages 49–59, Washington, DC, USA, 2007. IEEE Computer
Society.
[92] Hans Vandierendonck, Sean Rul, and Koen De Bosschere. The paralax infrastructure:
automatic parallelization with a helping hand. In Proceedings of the 19th interna-
tional conference on Parallel architectures and compilation techniques, PACT ’10,
pages 389–400, New York, NY, USA, 2010. ACM.
[93] Ganesh Venkatesh, Jack Sampson, Nathan Goulding, Saturnino Garcia, Vladyslav
Bryksin, Jose Lugo-Martinez, Steven Swanson, and Michael Bedford Taylor. Con-
servation cores: reducing the energy of mature computations. In ACM SIGARCH
Computer Architecture News, volume 38, pages 205–218. ACM, 2010.
[94] Michael W. Multiprocessor synchronization for concurrent loops. IEEE Parallel
Programming, pages 34–42, 1988.
[95] Douglas Brent West et al. Introduction to graph theory, volume 2. Prentice hall
Upper Saddle River, 2001.
BIBLIOGRAPHY 65
[96] Yuanming Zhang, Kanemitsu Ootsu, Takashi Yokota, and Takanobu Baba. Clus-
tered decoupled software pipelining on commodity cmp. In Parallel and Distributed
Systems, 2008. ICPADS ’08. 14th IEEE International Conference on, pages 681–688,
Dec 2008.
[97] Hongtao Zhong, Mojtaba Mehrara, Steve Lieberman, and Scott Mahlke. Uncovering
hidden loop level parallelism in sequential applications. In 2008 IEEE 14th Interna-
tional Symposium on High Performance Computer Architecture, pages 290–301, Feb
2008.
66
Appendix A
Speedup by Batch and Stage Size
Batch Size
1 10 20 30 40 50 60 70 80 90 100 200 300 400 500
S
ta
ge
S
iz
e
1 0.1 0.4 0.5 0.6 0.7 0.6 0.7 0.8 0.8 0.8 0.9 0.7 0.7 0.7 0.7
2 0.1 0.7 1.0 1.1 1.1 1.3 1.5 1.4 1.4 1.4 1.3 1.1 1.1 1.1 1.2
3 0.2 0.7 0.9 1.0 1.1 1.1 1.2 1.2 1.2 1.2 1.2 1.2 1.1 1.2 1.2
4 0.2 0.8 1.0 1.1 1.1 1.2 1.3 1.3 1.2 1.3 1.3 1.2 1.3 1.3 1.3
5 0.2 0.9 1.1 1.1 1.2 1.3 1.2 1.3 1.3 1.4 1.4 1.3 1.3 1.3 1.3
6 0.3 1.0 1.2 1.2 1.3 1.3 1.3 1.3 1.3 1.4 1.4 1.3 1.4 1.3 1.4
7 0.3 1.0 1.1 1.2 1.3 1.4 1.3 1.3 1.4 1.3 1.4 1.4 1.4 1.4 1.4
8 0.4 0.9 1.0 1.0 1.0 1.0 1.1 1.1 1.0 1.0 1.0 1.5 1.4 1.5 1.4
9 0.4 0.9 1.0 1.1 1.0 1.0 1.1 1.0 1.0 1.1 1.0 1.2 1.4 1.3 1.2
10 0.4 1.1 1.1 1.2 1.2 1.3 1.3 1.2 1.3 1.3 1.3 1.4 1.5 1.4 1.4
20 0.7 1.3 1.3 1.3 1.7 1.5 1.5 1.5 1.5 1.7 1.4 1.4 1.3 1.2 1.3
30 0.9 1.4 1.5 1.5 1.4 1.6 1.6 1.6 1.5 1.4 1.6 1.5 1.4 1.5 1.6
40 1.0 1.4 1.5 1.5 1.5 1.5 1.6 1.6 1.6 1.6 1.7 1.5 1.4 1.6 1.4
50 1.1 1.5 1.5 1.6 1.5 1.6 1.6 1.6 1.6 1.5 1.6 1.6 1.6 1.5 1.5
60 1.2 1.6 1.5 1.6 1.5 1.8 1.6 1.7 1.6 1.5 1.6 1.5 1.5 1.5 1.5
70 1.1 1.5 1.6 1.6 1.6 1.7 1.6 1.6 1.8 1.7 1.6 1.5 1.6 1.7 1.6
80 1.2 1.6 1.7 1.6 1.6 1.7 1.6 1.7 1.7 1.6 1.7 1.5 1.6 1.7 1.6
90 1.3 1.7 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.7 1.8 1.6 1.8 1.7 1.7
100 1.3 1.6 1.7 1.7 1.7 1.7 1.7 1.7 1.8 1.7 1.7 1.7 1.8 1.6 1.6
Table A.1: Results for BDX speedup in function of batch and stage size.
67
Appendix B
Speedup by variable and uses
Please see next page.
APPENDIX B. SPEEDUP BY VARIABLE AND USES 68
N
u
m
b
er
of
V
ar
ia
b
le
s
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
NumberofUses
0
1.
3
1.
5
1.
6
1.
8
1.
5
1.
7
1.
5
1.
6
1.
6
1.
4
1.
8
1.
4
1.
5
1.
5
1.
4
1.
6
1.
4
1.
6
1.
4
1.
7
1.
4
1
1.
4
1.
0
1.
0
0.
9
0.
9
0.
8
0.
8
0.
8
0.
9
0.
7
0.
9
0.
9
1.
0
1.
0
1.
0
1.
0
1.
1
1.
1
1.
1
1.
1
1.
2
2
1.
4
0.
9
0.
8
0.
8
0.
8
0.
9
1.
1
1.
3
1.
2
1.
1
1.
1
1.
1
1.
3
1.
1
1.
2
1.
3
1.
2
1.
3
1.
3
1.
3
1.
2
3
1.
4
1.
0
0.
8
0.
9
1.
1
1.
2
1.
2
1.
4
1.
3
1.
3
1.
2
1.
3
1.
2
1.
3
1.
4
1.
4
1.
4
1.
4
1.
4
1.
4
1.
4
4
1.
4
0.
9
0.
8
1.
1
1.
3
1.
2
1.
3
1.
3
1.
2
1.
4
1.
3
1.
4
1.
5
1.
4
1.
4
1.
4
1.
3
1.
5
1.
4
1.
4
1.
5
5
1.
4
0.
8
1.
1
1.
2
1.
2
1.
4
1.
4
1.
4
1.
4
1.
4
1.
2
1.
3
1.
4
1.
4
1.
5
1.
5
1.
4
1.
4
1.
6
1.
6
1.
5
6
1.
4
0.
8
1.
1
1.
3
1.
2
1.
3
1.
3
1.
5
1.
4
1.
4
1.
4
1.
5
1.
5
1.
4
1.
5
1.
5
1.
6
1.
6
1.
6
1.
6
1.
6
7
1.
4
0.
8
1.
2
1.
4
1.
3
1.
4
1.
4
1.
5
1.
5
1.
5
1.
4
1.
4
1.
5
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
8
1.
4
0.
8
1.
1
1.
3
1.
4
1.
4
1.
4
1.
6
1.
5
1.
4
1.
5
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
9
1.
4
0.
9
1.
3
1.
4
1.
3
1.
4
1.
5
1.
4
1.
6
1.
5
1.
5
1.
6
1.
6
1.
6
1.
7
1.
7
1.
4
1.
6
1.
6
1.
7
1.
7
10
1.
5
0.
9
1.
4
1.
4
1.
4
1.
4
1.
4
1.
5
1.
5
1.
5
1.
7
1.
6
1.
6
1.
6
1.
7
1.
7
1.
6
1.
7
1.
7
1.
7
1.
7
11
1.
4
1.
0
1.
3
1.
2
1.
4
1.
6
1.
6
1.
5
1.
6
1.
5
1.
6
1.
7
1.
6
1.
6
1.
7
1.
7
1.
7
1.
7
1.
6
1.
7
1.
7
12
1.
4
1.
1
1.
4
1.
6
1.
4
1.
5
1.
5
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
6
1.
8
13
1.
4
1.
1
1.
3
1.
4
1.
4
1.
5
1.
5
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
6
1.
7
1.
7
1.
7
1.
6
1.
7
1.
6
1.
7
14
1.
4
1.
1
1.
4
1.
5
1.
3
1.
5
1.
6
1.
6
1.
6
1.
6
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
15
1.
4
1.
3
1.
4
1.
4
1.
5
1.
6
1.
6
1.
6
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
8
1.
7
1.
7
1.
7
1.
7
1.
7
16
1.
4
1.
1
1.
4
1.
5
1.
4
1.
6
1.
6
1.
6
1.
7
1.
6
1.
7
1.
7
1.
7
1.
6
1.
8
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
17
1.
4
1.
3
1.
4
1.
6
1.
6
1.
6
1.
7
1.
6
1.
7
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
5
1.
7
1.
7
1.
8
1.
8
1.
7
18
1.
4
1.
1
1.
3
1.
5
1.
5
1.
6
1.
6
1.
6
1.
7
1.
7
1.
6
1.
7
1.
8
1.
7
1.
8
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
19
1.
4
1.
1
1.
4
1.
4
1.
5
1.
6
1.
6
1.
6
1.
7
1.
7
1.
7
1.
7
1.
7
1.
7
1.
8
1.
8
1.
7
1.
7
1.
7
1.
8
1.
7
20
1.
4
1.
3
1.
6
1.
6
1.
5
1.
6
1.
7
1.
6
1.
7
1.
6
1.
8
1.
8
1.
7
1.
7
1.
7
1.
8
1.
7
1.
7
1.
7
1.
7
1.
7
Table B.1: Results for BDX speedup in function of the number of variables and variable
uses.
69
Appendix C
Serial Version of the Example Loop
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3
4 typede f s t r u c t __node__ {
5 i n t val ;
6 s t r u c t __node__∗ next ;
7 } node ;
8
9 void createList ( node ∗∗root , i n t length ) {
10 (∗ root )−>val = 0 ;
11
12 node∗∗ prev = &(∗root )−>next ;
13
14 f o r ( i n t i=1; i<length ; i++) {
15 node∗ n1 = ( node ∗) malloc ( s i z e o f ( node ) ) ;
16
17 n1−>val = i ;
18 ∗prev = n1 ;
19
20 prev = &n1−>next ;
21 }
22 }
23
24 i n t main ( ) {
25 node∗ root = ( node ∗) malloc ( s i z e o f ( node ) ) ;
26
27 createList(&root , 30) ;
28
29 node∗ ptr = root ;
30 whi le ( ptr = ptr−>next ) {
31 printf ( "Node value => %d\n" , ptr−>val ) ;
32 }
33
34 re turn 0 ;
35 }
70
Appendix D
BDX Version of the Example Loop
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <pthread . h>
4
5 #de f i n e ACQUIRE( lock ) whi l e ( ( l o ck ) != MyThreadId ) ;
6 #de f i n e RELEASE( lock ) ( l ock ) = (MyThreadId+1) % TOTAL_NUM_THREADS;
7
8 #de f i n e NUM_NODES 30
9 #de f i n e TOTAL_NUM_THREADS 2
10 #de f i n e BDX_BATCH_SIZE 50
11 #de f i n e WAIT 0
12 #de f i n e ITERATE 1
13 #de f i n e FINISH 2
14
15 typede f s t r u c t __node__ {
16 i n t val ;
17 s t r u c t __node__∗ next ;
18 } node ;
19
20 v o l a t i l e node∗ Sptr ;
21 v o l a t i l e i n t bdxThreadStatus [ 2 ] ;
22 v o l a t i l e i n t StageA_Lock ;
23 v o l a t i l e i n t StageB_Lock ;
24
25 pthread_t bdx_threads [ TOTAL_NUM_THREADS ] ;
26
27 void createList ( node ∗∗root , i n t length ) {
28 (∗ root )−>val = 0 ;
29
30 node∗∗ prev = &(∗root )−>next ;
31
32 f o r ( i n t i=1; i<length ; i++) {
33 node∗ n1 = ( node ∗) malloc ( s i z e o f ( node ) ) ;
34
35 n1−>val = i ;
36 ∗prev = n1 ;
37
APPENDIX D. BDX VERSION OF THE EXAMPLE LOOP 71
38 prev = &n1−>next ;
39 }
40 }
41
42 void ∗ BDX_Thread_Source ( void ∗ _param ) {
43 i n t OriginalLoopCondition = 0 , MyThreadId = 1 ;
44 node∗ BdxBufferPtrA [ BDX_BATCH_SIZE ] ;
45
46 // //////////////////////////////////////////////////////
47 // Keep threads a l i v e u n t i l the program f i n i s h execut ion
48 // bdxThreadStatus == 0 , [ busy ] wa i t ing .
49 // bdxThreadStatus == 1 , i t e r a t e to execute the loop once .
50 // bdxThreadStatus == 2 , f i n i s h t h i s thread .
51 whi le (1 ) {
52 whi le ( bdxThreadStatus [ MyThreadId ] == WAIT ) ;
53 i f ( bdxThreadStatus [ MyThreadId ] == FINISH ) break ;
54
55 /∗ I t e r a t e over s e v e r a l batches un t i l the o r i g i n a l loop cond i t i on ←↩
i s f a l s e ∗/
56 whi le ( ! OriginalLoopCondition ) {
57 i n t IdxA = 0 , IdxB = 0 ;
58
59 /∗ Star t Stage A ∗/
60 ACQUIRE ( StageA_Lock ) ;
61 node∗ ptr = ( node ∗) Sptr ;
62
63 whi le ( IdxA < BDX_BATCH_SIZE && ( ptr && ( ptr = ptr−>next ) ) ) {
64 BdxBufferPtrA [ IdxA++] = ptr ;
65 }
66
67 Sptr = ptr ;
68 OriginalLoopCondition = ( ptr == NULL ) ;
69 RELEASE ( StageA_Lock ) ;
70 /∗ End Stage A ∗/
71
72 /∗ Star t Stage B ∗/
73 ACQUIRE ( StageB_Lock ) ;
74 whi le ( IdxB < IdxA ) {
75 ptr = BdxBufferPtrA [ IdxB++];
76 printf ( "Node value => %d\n" , ptr−>val ) ;
77 }
78 RELEASE ( StageB_Lock ) ;
79 /∗ End Stage B ∗/
80 }
81
82 bdxThreadStatus [ MyThreadId ] = WAIT ;
83 }
84 }
85
86 /∗
87 ∗ Main entry po int
88 ∗/
89 i n t main ( ) {
APPENDIX D. BDX VERSION OF THE EXAMPLE LOOP 72
90 i n t MyThreadId = 0 , threadIdHolder = 1 , OriginalLoopCondition = 0 ;
91 node∗ root = ( node ∗) malloc ( s i z e o f ( node ) ) ;
92 node∗ BdxBufferPtrA [ BDX_BATCH_SIZE ] ;
93
94 /∗ I n i t i a l i z e loop l i s t ∗/
95 createList(&root , NUM_NODES ) ;
96
97 /∗ Create other BDX threads ∗/
98 pthread_create(&bdx_threads [ threadIdHolder ] , NULL , BDX_Thread_Source , ←↩
&threadIdHolder ) ;
99
100 /∗ Set up input v a r i a b l e s to the loop ∗/
101 bdxThreadStatus [ threadIdHolder ] = ITERATE ;
102 StageA_Lock = StageB_Lock = 0 ;
103 Sptr = root ;
104
105 whi le ( ! OriginalLoopCondition ) {
106 i n t IdxA = 0 , IdxB = 0 ;
107
108 /∗ Star t Stage A ∗/
109 ACQUIRE ( StageA_Lock ) ;
110 node∗ ptr = ( node ∗) Sptr ;
111
112 whi le ( IdxA < BDX_BATCH_SIZE && ( ptr && ( ptr = ptr−>next ) ) ) {
113 BdxBufferPtrA [ IdxA++] = ptr ;
114 }
115
116 Sptr = ptr ;
117 OriginalLoopCondition = ( ptr == NULL ) ;
118 RELEASE ( StageA_Lock ) ;
119 /∗ End Stage A ∗/
120
121 /∗ Star t Stage B ∗/
122 ACQUIRE ( StageB_Lock ) ;
123 whi le ( IdxB < IdxA ) {
124 ptr = BdxBufferPtrA [ IdxB++];
125 printf ( "Node value => %d\n" , ptr−>val ) ;
126 }
127 RELEASE ( StageB_Lock ) ;
128 /∗ End Stage B ∗/
129 }
130
131 /∗ Wait a l l other threads to conf i rm they have f i n i s h e d p ro c e s s i ng ∗/
132 whi le ( bdxThreadStatus [ threadIdHolder ] != WAIT ) ;
133
134 /∗ Te l l a l l o ther threads to f i n i s h execut ion ∗/
135 bdxThreadStatus [ threadIdHolder ] = FINISH ;
136
137 /∗ Wait other threads to f i n i s h execut ion ∗/
138 pthread_join ( bdx_threads [ threadIdHolder ] , NULL ) ;
139
140 re turn 0 ;
141 }
73
Appendix E
Speedup by Batch and Stage Size -
The Sources
1 #inc lude <s td i o . h>
2 #inc lude " c on f i g . h"
3
4 i n t main ( ) {
5 r e g i s t e r i n t ans = 0 ;
6
7 f o r ( i n t i = 0 ; i < ITERATIONS ; i++) {
8 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
9 f o r ( i n t j=0; j<LOAD ; j++);
10 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
11
12 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
13 f o r ( i n t j=0; j<LOAD ; j++);
14 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
15 }
16
17 printf ( "ans = %d\n" , ans ) ;
18
19 re turn 0 ;
20 }
Listing E.1: exp_batch_sizes-ser.cpp
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <omp . h>
4 #inc lude <i t t n o t i f y . h>
5 #inc lude " c on f i g . h"
6
7 v o l a t i l e i n t __bdx_flags [ ] = {0 , 0 , 0} ;
8
9 i n t main ( ) {
10 r e g i s t e r i n t ans = 0 ;
APPENDIX E. SOURCES SPEEDUP BY BATCH AND STAGE SIZE 74
11
12 #pragma omp parallel f o r num_threads (2 ) schedule ( s t a t i c , 1)
13 f o r ( i n t i=0; i<ITERATIONS ; i+=BATCH ) {
14 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
15 WAIT (0 ) ;
16 f o r ( i n t __bdx_i = i ; ( __bdx_i < i+BATCH ) && __bdx_i < ITERATIONS ;←↩
__bdx_i++) {
17 f o r ( i n t j=0; j<LOAD ; j++);
18 }
19 POST (0 ) ;
20 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
21
22 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
23 WAIT (1 ) ;
24 f o r ( i n t __bdx_i = i ; ( __bdx_i < i+BATCH ) && __bdx_i < ITERATIONS ;←↩
__bdx_i++) {
25 f o r ( i n t j=0; j<LOAD ; j++);
26 }
27 POST (1 ) ;
28 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
29 }
30
31 printf ( "ans = %d\n" , ans ) ;
32
33 re turn 0 ;
34 }
Listing E.2: exp_batch_sizes-par.cpp
1 #de f i n e ITERATIONS 1000000
2
3 #i f n d e f LOAD
4 #define LOAD 10
5 #end i f
6
7 #i f n d e f BATCH
8 #define BATCH 1000
9 #end i f
10
11 #de f i n e WAIT(STG) do { whi l e (__bdx_flags [STG] != omp_get_thread_num ( ) ) ;←↩
} whi l e (0 )
12 #de f i n e POST(STG) do { __bdx_flags [STG] = (__bdx_flags [STG] + 1) % 2 ; } ←↩
whi le (0 )
Listing E.3: config.h
1 PROG_NAME=exp_batch_sizes
2
3 all : serial parallel
4
5 serial : $ ( PROG_NAME )−ser . cpp
APPENDIX E. SOURCES SPEEDUP BY BATCH AND STAGE SIZE 75
6 clang++ $ ( PROG_NAME )−ser . cpp −g −O0 −o $ ( PROG_NAME )−ser −DLOAD=$ ( LOAD ) ←↩
−DBATCH=$ ( BATCH ) /opt/intel/vtune_amplifier_xe_2017 . 2 . 0 . 4 99904/←↩
lib64/libittnotify . a −ldl
7
8 parallel : $ ( PROG_NAME )−par . cpp
9 clang++ $ ( PROG_NAME )−par . cpp −g −O0 −o $ ( PROG_NAME )−par −DLOAD=$ ( LOAD ) ←↩
−DBATCH=$ ( BATCH ) −fopenmp=libiomp5 /opt/intel/←↩
vtune_amplifier_xe_2017 . 2 . 0 . 4 99904/ lib64/libittnotify . a −ldl
10
11 clean :
12 rm −rf $ ( PROG_NAME )−par $ ( PROG_NAME )−ser
Listing E.4: Makefile
1 #!/ bin /bash
2
3 # Set p r e c i s i o n and format o f the Bash time command
4 export TIMEFORMAT= '%3R ' ;
5
6 # How many times the program w i l l be executed to take average
7 Reps=10;
8 Batches=(1 10 20 30 40 50 60 70 80 90 100 200 300 400 500 600 700 800 900 ←↩
1000) ;
9 Loads=(1 2 3 4 5 6 7 8 9 10 20 30 40 50 60 70 80 90 100) ;
10
11 # Clean the log f i l e
12 mv times . log times . log . bkp
13
14 printf " " >> times . log ;
15
16 # Print the Table ' s headers
17 f o r BSSize in ${Batches [ @ ] } ; do
18 printf " %4d" $BSSize >> times . log ;
19 done
20
21 printf "\n" >> times . log ;
22
23 f o r STSize in ${Loads [ @ ] } ; do
24 printf "%3d" $STSize >> times . log ;
25 f o r BSSize in ${Batches [ @ ] } ; do
26 make LOAD=$STSize BATCH=$BSSize 2>&1 >> /dev/null
27
28 SerAvg=0;
29 ParAvg=0;
30
31 f o r Rep in `seq $Reps ` ; do
32 # Execute the program and collect the execution time
33 Ser=$ ( time ( . / exp_batch_sizes−ser 2>/dev/null 1>&2 ) 2>&1 )
34 Par=$ ( time ( . / exp_batch_sizes−par 2>/dev/null 1>&2 ) 2>&1 )
35 # Accumulate the average parallel and serial execution time
36 SerAvg="$ ( echo "$SerAvg + $Ser" | bc ; ) " ;
37 ParAvg="$ ( echo "$ParAvg + $Par" | bc ; ) " ;
APPENDIX E. SOURCES SPEEDUP BY BATCH AND STAGE SIZE 76
38 done
39
40 # Calculate average of the serial and parallel time
41 SerAvg="$ ( echo "$SerAvg / $Reps" | bc ; ) " ;
42 ParAvg="$ ( echo "$ParAvg / $Reps" | bc ; ) " ;
43
44 # Calculate speedup
45 Speedup="$ ( echo "scale=2; $SerAvg / $ParAvg" | bc ; ) " ;
46
47 printf " %1.2 f " $Speedup >> times . log ;
48 done
49
50 printf "\n" >> times . log ;
51 done
Listing E.5: run.sh
77
Appendix F
Effect of the number of
Loop-independent dependences -
The Sources
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <s t r i n g . h>
4
5 char names [ ] [ 3 ] = { "aa" , "bb" , " cc " , "dd" , " ee " ,
6 " f f " , "gg" , "hh" , " i i " , " j j " ,
7 "kk" , " l l " , "mm" , "nn" , "oo" ,
8 "pp" , "qq" , " r r " , " s s " , " t t " } ;
9
10 void serPrologue ( i n t NVars , i n t NUsos ) {
11 FILE∗ fp = fopen ( " ser−pro logue . inc " , "w+" ) ;
12
13 fprintf (fp , " r e g i s t e r i n t ans = 0 ;\ n" ) ;
14
15 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
16 fprintf (fp , " r e g i s t e r i n t %s = 0 ;\n" , names [ varIdx ] ) ;
17
18 fclose (fp ) ;
19 }
20
21
22 void serLoad ( i n t NVars , i n t NUsos ) {
23 FILE∗ fp = fopen ( " ser−load . inc " , "w+" ) ;
24
25 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
26 fprintf (fp , "USE%02d(%s ) \n" , NUsos , names [ varIdx ] ) ;
27
28 fclose (fp ) ;
29 }
30
31 void serEpilogue ( i n t NVars , i n t NUsos ) {
32 FILE∗ fp = fopen ( " ser−ep i l o gue . inc " , "w+" ) ;
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 78
33
34 fprintf (fp , " p r i n t f (\" ans = %%d\\n\" , ans ) ; \ n" ) ;
35
36 fclose (fp ) ;
37 }
38
39
40
41 void parPrologue ( i n t NVars , i n t NUsos , i n t BSize ) {
42 FILE∗ fp = fopen ( "par−pro logue . inc " , "w+" ) ;
43
44 fprintf (fp , " r e g i s t e r i n t ans = 0 ;\ n" ) ;
45
46 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
47 fprintf (fp , " i n t %s [%d ] ; \ n" , names [ varIdx ] , BSize ) ;
48
49 fclose (fp ) ;
50 }
51
52 void parLoad ( i n t NVars , i n t NUsos , i n t BSize ) {
53 FILE∗ fp = fopen ( "par−load . inc " , "w+" ) ;
54
55 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
56 fprintf (fp , "PUSE%02d(%s ) \n" , NUsos , names [ varIdx ] ) ;
57
58 fclose (fp ) ;
59 }
60
61 void parEpilogue ( i n t NVars , i n t NUsos , i n t BSize ) {
62 FILE∗ fp = fopen ( "par−ep i l o gue . inc " , "w+" ) ;
63
64 fprintf (fp , " p r i n t f (\" ans = %%d\\n\" , ans ) ; \ n" ) ;
65
66 fclose (fp ) ;
67 }
68
69 i n t main ( i n t argc , char ∗ argv [ ] ) {
70 i f ( argc < 5) {
71 fprintf ( stderr , "Usage : \ n\ t%s Ser /Par NUsos NVars [ BSize ] \ n" , argv←↩
[ 0 ] ) ;
72 exit (1 ) ;
73 }
74
75 char ∗ Type = argv [ 1 ] ;
76 i n t NVars = atoi ( argv [ 2 ] ) ;
77 i n t NUsos = atoi ( argv [ 3 ] ) ;
78 i n t BSize = atoi ( argv [ 4 ] ) ;
79
80 i f ( strcmp (Type , " Ser " ) == 0) {
81 serPrologue ( NVars , NUsos ) ;
82 serLoad ( NVars , NUsos ) ;
83 serEpilogue ( NVars , NUsos ) ;
84 }
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 79
85 e l s e i f ( strcmp (Type , "Par" ) == 0) {
86 parPrologue ( NVars , NUsos , BSize ) ;
87 parLoad ( NVars , NUsos , BSize ) ;
88 parEpilogue ( NVars , NUsos , BSize ) ;
89 }
90 e l s e {
91 fprintf ( stderr , "Type not r ecogn i z ed : %s \n" , Type ) ;
92 exit (1 ) ;
93 }
94
95 re turn 0 ;
96 }
Listing F.1: exp_loop_indep_dependences-gen.cpp
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <time . h>
4 #inc lude " c on f i g . h"
5
6 i n t main ( ) {
7 #include " ser−pro logue . inc "
8
9 f o r ( i n t i = 0 ; i < ITERATIONS ; i++) {
10 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
11 f o r ( i n t j=0; j<LOAD ; j++);
12 #include " ser−load . inc "
13 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
14
15 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
16 f o r ( i n t j=0; j<LOAD ; j++);
17 #include " ser−load . inc "
18 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
19 }
20
21 #include " ser−ep i l o gue . inc "
22
23 re turn 0 ;
24 }
Listing F.2: exp_loop_indep_dependences-ser.cpp
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <omp . h>
4 #inc lude " c on f i g . h"
5
6 v o l a t i l e i n t __bdx_flags [ ] = {0 , 0 , 0} ;
7
8 i n t main ( ) {
9 #include "par−pro logue . inc "
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 80
10
11 #pragma omp parallel f o r num_threads (2 ) schedule ( s t a t i c , 1)
12 f o r ( i n t i=0; i<ITERATIONS ; i+=BATCH ) {
13 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
14 WAIT (0 ) ;
15 f o r ( i n t __bdx_i = i , __bdx_buf_idx = 0 ; ( __bdx_i < i+BATCH ) && ←↩
__bdx_i < ITERATIONS ; __bdx_i++, __bdx_buf_idx++) {
16 f o r ( i n t j=0; j<LOAD ; j++);
17 #include "par−load . inc "
18 }
19 POST (0 ) ;
20 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
21
22 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
23 WAIT (1 ) ;
24 f o r ( i n t __bdx_i = i , __bdx_buf_idx = 0 ; ( __bdx_i < i+BATCH ) && ←↩
__bdx_i < ITERATIONS ; __bdx_i++, __bdx_buf_idx++) {
25 f o r ( i n t j=0; j<LOAD ; j++);
26 #include "par−load . inc "
27 }
28 POST (1 ) ;
29 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
30 }
31
32 #include "par−ep i l o gue . inc "
33
34 re turn 0 ;
35 }
Listing F.3: exp_loop_indep_dependences-par.cpp
1 #de f i n e ITERATIONS 1000000
2
3 #de f i n e LOAD 10
4 #de f i n e BATCH 40
5
6 #de f i n e WAIT(STG) do { whi l e (__bdx_flags [STG] != ←↩
omp_get_thread_num ( ) ) ; } whi l e (0 )
7 #de f i n e POST(STG) do { __bdx_flags [STG] = (__bdx_flags [STG] + 1)←↩
% 2; } whi l e (0 )
8
9 #de f i n e USE00( var )
10 #de f i n e USE01( var ) do { ( ans ) += ( var ) ; } whi l e (0 ) ;
11 #de f i n e USE02( var ) USE01( var ) USE01( var )
12 #de f i n e USE03( var ) USE02( var ) USE01( var )
13 #de f i n e USE04( var ) USE02( var ) USE02( var )
14 #de f i n e USE05( var ) USE04( var ) USE01( var )
15 #de f i n e USE06( var ) USE05( var ) USE01( var )
16 #de f i n e USE07( var ) USE05( var ) USE02( var )
17 #de f i n e USE08( var ) USE04( var ) USE04( var )
18 #de f i n e USE09( var ) USE08( var ) USE01( var )
19 #de f i n e USE10( var ) USE05( var ) USE05( var )
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 81
20 #de f i n e USE11( var ) USE10( var ) USE01( var )
21 #de f i n e USE12( var ) USE10( var ) USE02( var )
22 #de f i n e USE13( var ) USE10( var ) USE03( var )
23 #de f i n e USE14( var ) USE10( var ) USE04( var )
24 #de f i n e USE15( var ) USE10( var ) USE05( var )
25 #de f i n e USE16( var ) USE10( var ) USE06( var )
26 #de f i n e USE17( var ) USE10( var ) USE07( var )
27 #de f i n e USE18( var ) USE10( var ) USE08( var )
28 #de f i n e USE19( var ) USE10( var ) USE09( var )
29 #de f i n e USE20( var ) USE10( var ) USE10( var )
30
31 #de f i n e PUSE00( var )
32 #de f i n e PUSE01( var ) do { ( ans ) += ( var [ __bdx_buf_idx ] ) ; } whi l e ←↩
(0 ) ;
33 #de f i n e PUSE02( var ) PUSE01( var ) PUSE01( var )
34 #de f i n e PUSE03( var ) PUSE02( var ) PUSE01( var )
35 #de f i n e PUSE04( var ) PUSE02( var ) PUSE02( var )
36 #de f i n e PUSE05( var ) PUSE04( var ) PUSE01( var )
37 #de f i n e PUSE06( var ) PUSE05( var ) PUSE01( var )
38 #de f i n e PUSE07( var ) PUSE05( var ) PUSE02( var )
39 #de f i n e PUSE08( var ) PUSE04( var ) PUSE04( var )
40 #de f i n e PUSE09( var ) PUSE08( var ) PUSE01( var )
41 #de f i n e PUSE10( var ) PUSE05( var ) PUSE05( var )
42 #de f i n e PUSE11( var ) PUSE10( var ) PUSE01( var )
43 #de f i n e PUSE12( var ) PUSE10( var ) PUSE02( var )
44 #de f i n e PUSE13( var ) PUSE10( var ) PUSE03( var )
45 #de f i n e PUSE14( var ) PUSE10( var ) PUSE04( var )
46 #de f i n e PUSE15( var ) PUSE10( var ) PUSE05( var )
47 #de f i n e PUSE16( var ) PUSE10( var ) PUSE06( var )
48 #de f i n e PUSE17( var ) PUSE10( var ) PUSE07( var )
49 #de f i n e PUSE18( var ) PUSE10( var ) PUSE08( var )
50 #de f i n e PUSE19( var ) PUSE10( var ) PUSE09( var )
51 #de f i n e PUSE20( var ) PUSE10( var ) PUSE10( var )
Listing F.4: config.h
1 PROG_NAME=exp_loop_indep_dependences
2
3 all : serial parallel
4
5 serial : $ ( PROG_NAME )−ser . cpp
6 @clang++ $ ( PROG_NAME )−ser . cpp −g −O0 −E > $ ( PROG_NAME )−ser . prep
7 clang++ $ ( PROG_NAME )−ser . cpp −g −O0 −o $ ( PROG_NAME )−ser
8
9 parallel : $ ( PROG_NAME )−par . cpp
10 @clang++ $ ( PROG_NAME )−par . cpp −g −O0 −E > $ ( PROG_NAME )−par . prep
11 clang++ $ ( PROG_NAME )−par . cpp −g −O0 −o $ ( PROG_NAME )−par −fopenmp
12
13 clean :
14 rm −rf $ ( PROG_NAME )−par $ ( PROG_NAME )−ser
Listing F.5: Makefile
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 82
1 #!/ bin /bash
2
3 # Set p r e c i s i o n and format o f the Bash time command
4 export TIMEFORMAT= '%3R ' ;
5
6 # How many times the program w i l l be executed to take average
7 Reps=10;
8
9 # What i s the f i x ed batch s i z e to use
10 Batch=50;
11
12 # Number o f vars and number o f uses to s imulate
13 NVars=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20) ;
14 NUses=(0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20) ;
15
16 # Clean the log f i l e
17 mv times . log times . log . bkp
18
19 printf " " >> times . log ;
20
21 # Print the Table ' s headers
22 f o r NVar in ${NVars [ @ ] } ; do
23 printf " %4d" $NVar >> times . log ;
24 done
25
26 printf "\n" >> times . log ;
27
28 f o r NUse in ${NUses [ @ ] } ; do
29 printf "%3d" $NUse >> times . log ;
30 f o r NVar in ${NVars [ @ ] } ; do
31 . / exp_loop_indep_dependences−gen Ser $NVar $NUse 0
32 . / exp_loop_indep_dependences−gen Par $NVar $NUse $Batch
33
34 make 2>&1 >> /dev/null
35
36 SerAvg=0;
37 ParAvg=0;
38
39 f o r Rep in `seq $Reps ` ; do
40 # Execute the program and collect the execution time
41 Ser=1
42 Par=1
43 Ser=$ ( time ( . / exp_loop_indep_dependences−ser 2>/dev/null ←↩
1>&2 ) 2>&1 )
44 Par=$ ( time ( . / exp_loop_indep_dependences−par 2>/dev/null ←↩
1>&2 ) 2>&1 )
45
46 # Accumulate the average parallel and serial execution time
47 SerAvg="$ ( echo "$SerAvg + $Ser" | bc ; ) " ;
48 ParAvg="$ ( echo "$ParAvg + $Par" | bc ; ) " ;
49 done
50
APPENDIX F. SOURCES EFFECT OF LOOP-INDEPENDENT DEPENDENCES 83
51 # Calculate average of the serial and parallel time
52 SerAvg="$ ( echo "$SerAvg / $Reps" | bc ; ) " ;
53 ParAvg="$ ( echo "$ParAvg / $Reps" | bc ; ) " ;
54
55 # Calculate speedup
56 Speedup="$ ( echo "scale=2; $SerAvg / $ParAvg" | bc ; ) " ;
57
58 printf " %1.2 f " $Speedup >> times . log ;
59 done
60
61 printf "\n" >> times . log ;
62 done
Listing F.6: run.sh
84
Appendix G
Effect of the number of loop-dependent
dependences -
The Sources
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <s t r i n g . h>
4
5 char names [ ] [ 3 ] = { "aa" , "bb" , " cc " , "dd" , " ee " ,
6 " f f " , "gg" , "hh" , " i i " , " j j " ,
7 "kk" , " l l " , "mm" , "nn" , "oo" ,
8 "pp" , "qq" , " r r " , " s s " , " t t " } ;
9
10 void serPrologue ( i n t NVars , i n t NUsos ) {
11 FILE∗ fp = fopen ( " ser−pro logue . inc " , "w+" ) ;
12
13 fprintf (fp , " i n t ans = 0 ;\ n" ) ;
14
15 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
16 fprintf (fp , " i n t %s = 0 ;\ n" , names [ varIdx ] ) ;
17
18 fclose (fp ) ;
19 }
20
21
22 void serLoad ( i n t NVars , i n t NUsos ) {
23 FILE∗ fp1 = fopen ( " ser−load1 . inc " , "w+" ) ;
24 FILE∗ fp2 = fopen ( " ser−load2 . inc " , "w+" ) ;
25
26 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx+=2) {
27 fprintf (fp1 , "USE%02d(%s ) \n" , NUsos , names [ varIdx ] ) ;
28 fprintf (fp2 , "USE%02d(%s ) \n" , NUsos , names [ varIdx+1]) ;
29 }
30
31 fclose ( fp1 ) ;
32 fclose ( fp2 ) ;
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 85
33 }
34
35 void serEpilogue ( i n t NVars , i n t NUsos ) {
36 FILE∗ fp = fopen ( " ser−ep i l o gue . inc " , "w+" ) ;
37
38 fprintf (fp , " p r i n t f (\" ans = %%d\\n\" , ans ) ; \ n" ) ;
39
40 fclose (fp ) ;
41 }
42
43
44 void parFetch1 ( i n t NVars ) {
45 FILE∗ fp = fopen ( "par−f e t ch1 . inc " , "w+" ) ;
46
47 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx+=2)
48 fprintf (fp , " r e g i s t e r i n t lc_%s = %s ; \ n" , names [ varIdx ] , names [←↩
varIdx ] ) ;
49
50 fclose (fp ) ;
51 }
52
53 void parFetch2 ( i n t NVars ) {
54 FILE∗ fp = fopen ( "par−f e t ch2 . inc " , "w+" ) ;
55
56 f o r ( i n t varIdx=1; varIdx<NVars ; varIdx+=2)
57 fprintf (fp , " r e g i s t e r i n t lc_%s = %s ; \ n" , names [ varIdx ] , names [←↩
varIdx ] ) ;
58
59 fclose (fp ) ;
60 }
61
62 void parStore1 ( i n t NVars ) {
63 FILE∗ fp = fopen ( "par−s t o r e 1 . inc " , "w+" ) ;
64
65 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx+=2)
66 fprintf (fp , "%s = lc_%s ; \ n" , names [ varIdx ] , names [ varIdx ] ) ;
67
68 fclose (fp ) ;
69 }
70
71 void parStore2 ( i n t NVars ) {
72 FILE∗ fp = fopen ( "par−s t o r e 2 . inc " , "w+" ) ;
73
74 f o r ( i n t varIdx=1; varIdx<NVars ; varIdx+=2)
75 fprintf (fp , "%s = lc_%s ; \ n" , names [ varIdx ] , names [ varIdx ] ) ;
76
77 fclose (fp ) ;
78 }
79
80 void parPrologue ( i n t NVars , i n t NUsos , i n t BSize ) {
81 FILE∗ fp = fopen ( "par−pro logue . inc " , "w+" ) ;
82
83 fprintf (fp , " v o l a t i l e i n t ans = 0 ;\ n" ) ;
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 86
84
85 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx++)
86 fprintf (fp , " v o l a t i l e i n t %s ; \ n" , names [ varIdx ] ) ;
87
88 fclose (fp ) ;
89 }
90
91 void parLoad ( i n t NVars , i n t NUsos , i n t BSize ) {
92 FILE∗ fp1 = fopen ( "par−load1 . inc " , "w+" ) ;
93 FILE∗ fp2 = fopen ( "par−load2 . inc " , "w+" ) ;
94
95 f o r ( i n t varIdx=0; varIdx<NVars ; varIdx+=2) {
96 fprintf (fp1 , "PUSE%02d( lc_%s ) \n" , NUsos , names [ varIdx ] ) ;
97 fprintf (fp2 , "PUSE%02d( lc_%s ) \n" , NUsos , names [ varIdx+1]) ;
98 }
99
100 fclose ( fp1 ) ;
101 fclose ( fp2 ) ;
102 }
103
104 void parEpilogue ( i n t NVars , i n t NUsos , i n t BSize ) {
105 FILE∗ fp = fopen ( "par−ep i l o gue . inc " , "w+" ) ;
106
107 fprintf (fp , " p r i n t f (\" ans = %%d\\n\" , ans ) ; \ n" ) ;
108
109 fclose (fp ) ;
110 }
111
112 i n t main ( i n t argc , char ∗ argv [ ] ) {
113 i f ( argc < 5) {
114 fprintf ( stderr , "Usage : \ n\ t%s Ser /Par NUsos NVars [ BSize ] \ n" , argv←↩
[ 0 ] ) ;
115 exit (1 ) ;
116 }
117
118 char ∗ Type = argv [ 1 ] ;
119 i n t NVars = atoi ( argv [ 2 ] ) ;
120 i n t NUsos = atoi ( argv [ 3 ] ) ;
121 i n t BSize = atoi ( argv [ 4 ] ) ;
122
123 i f ( strcmp (Type , " Ser " ) == 0) {
124 serPrologue ( NVars , NUsos ) ;
125 serLoad ( NVars , NUsos ) ;
126 serEpilogue ( NVars , NUsos ) ;
127 }
128 e l s e i f ( strcmp (Type , "Par" ) == 0) {
129 parPrologue ( NVars , NUsos , BSize ) ;
130 parLoad ( NVars , NUsos , BSize ) ;
131 parEpilogue ( NVars , NUsos , BSize ) ;
132 parFetch1 ( NVars ) ;
133 parFetch2 ( NVars ) ;
134 parStore1 ( NVars ) ;
135 parStore2 ( NVars ) ;
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 87
136 }
137 e l s e {
138 fprintf ( stderr , "Type not r ecogn i z ed : %s \n" , Type ) ;
139 exit (1 ) ;
140 }
141
142 re turn 0 ;
143 }
Listing G.1: exp_loop_depend_dependences-gen.cpp
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <time . h>
4 #inc lude " c on f i g . h"
5
6 i n t main ( ) {
7 #include " ser−pro logue . inc "
8
9 f o r ( i n t i = 0 ; i < ITERATIONS ; i++) {
10 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
11 f o r ( i n t j=0; j<LOAD ; j++);
12 #include " ser−load1 . inc "
13 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
14
15 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
16 f o r ( i n t j=0; j<LOAD ; j++);
17 #include " ser−load2 . inc "
18 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
19 }
20
21 #include " ser−ep i l o gue . inc "
22
23 re turn 0 ;
24 }
Listing G.2: exp_loop_depend_dependences-ser.cpp
1 #inc lude <s td i o . h>
2 #inc lude <s t d l i b . h>
3 #inc lude <omp . h>
4 #inc lude " c on f i g . h"
5
6 v o l a t i l e i n t __bdx_flags [ ] = {0 , 0 , 0} ;
7
8 #inc lude "par−pro logue . inc "
9
10 i n t main ( ) {
11
12 #pragma omp parallel f o r num_threads (2 ) schedule ( s t a t i c , 1)
13 f o r ( i n t i=0; i<ITERATIONS ; i+=BATCH ) {
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 88
14 /∗ ∗∗∗∗∗ Begin Stage 1 ∗∗∗∗∗∗∗∗ ∗/
15 WAIT (0 ) ;
16 #include "par−f e t ch1 . inc "
17 f o r ( i n t __bdx_i = i , __bdx_buf_idx = 0 ; ( __bdx_i < i+BATCH ) && ←↩
__bdx_i < ITERATIONS ; __bdx_i++, __bdx_buf_idx++) {
18 f o r ( i n t j=0; j<LOAD ; j++);
19 #include "par−load1 . inc "
20 }
21 #include "par−s t o r e 1 . inc "
22 POST (0 ) ;
23 /∗ ∗∗∗∗∗ End o f Stage 1 ∗∗∗∗∗∗ ∗/
24
25 /∗ ∗∗∗∗∗ Begin Stage 2 ∗∗∗∗∗∗∗∗ ∗/
26 WAIT (1 ) ;
27 #include "par−f e t ch2 . inc "
28 f o r ( i n t __bdx_i = i , __bdx_buf_idx = 0 ; ( __bdx_i < i+BATCH ) && ←↩
__bdx_i < ITERATIONS ; __bdx_i++, __bdx_buf_idx++) {
29 f o r ( i n t j=0; j<LOAD ; j++);
30 #include "par−load2 . inc "
31 }
32 #include "par−s t o r e 2 . inc "
33 POST (1 ) ;
34 /∗ ∗∗∗∗∗ End o f Stage 2 ∗∗∗∗∗∗ ∗/
35 }
36
37 #include "par−ep i l o gue . inc "
38
39 re turn 0 ;
40 }
Listing G.3: exp_loop_depend_dependences-par.cpp
1 #de f i n e ITERATIONS 1000000
2
3 #de f i n e LOAD 10
4 #de f i n e BATCH 40
5
6 #de f i n e WAIT(STG) do { whi l e (__bdx_flags [STG] != ←↩
omp_get_thread_num ( ) ) ; } whi l e (0 )
7 #de f i n e POST(STG) do { __bdx_flags [STG] = (__bdx_flags [STG] + 1)←↩
% 2; } whi l e (0 )
8
9 #de f i n e USE00( var )
10 #de f i n e USE01( var ) do { ( ans ) += ( var ) ; } whi l e (0 ) ;
11 #de f i n e USE02( var ) USE01( var ) USE01( var )
12 #de f i n e USE03( var ) USE02( var ) USE01( var )
13 #de f i n e USE04( var ) USE02( var ) USE02( var )
14 #de f i n e USE05( var ) USE04( var ) USE01( var )
15 #de f i n e USE06( var ) USE05( var ) USE01( var )
16 #de f i n e USE07( var ) USE05( var ) USE02( var )
17 #de f i n e USE08( var ) USE04( var ) USE04( var )
18 #de f i n e USE09( var ) USE08( var ) USE01( var )
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 89
19 #de f i n e USE10( var ) USE05( var ) USE05( var )
20 #de f i n e USE11( var ) USE10( var ) USE01( var )
21 #de f i n e USE12( var ) USE10( var ) USE02( var )
22 #de f i n e USE13( var ) USE10( var ) USE03( var )
23 #de f i n e USE14( var ) USE10( var ) USE04( var )
24 #de f i n e USE15( var ) USE10( var ) USE05( var )
25 #de f i n e USE16( var ) USE10( var ) USE06( var )
26 #de f i n e USE17( var ) USE10( var ) USE07( var )
27 #de f i n e USE18( var ) USE10( var ) USE08( var )
28 #de f i n e USE19( var ) USE10( var ) USE09( var )
29 #de f i n e USE20( var ) USE10( var ) USE10( var )
30
31 #de f i n e PUSE00( var )
32 #de f i n e PUSE01( var ) do { ( ans ) += ( var ) ; } whi l e (0 ) ;
33 #de f i n e PUSE02( var ) PUSE01( var ) PUSE01( var )
34 #de f i n e PUSE03( var ) PUSE02( var ) PUSE01( var )
35 #de f i n e PUSE04( var ) PUSE02( var ) PUSE02( var )
36 #de f i n e PUSE05( var ) PUSE04( var ) PUSE01( var )
37 #de f i n e PUSE06( var ) PUSE05( var ) PUSE01( var )
38 #de f i n e PUSE07( var ) PUSE05( var ) PUSE02( var )
39 #de f i n e PUSE08( var ) PUSE04( var ) PUSE04( var )
40 #de f i n e PUSE09( var ) PUSE08( var ) PUSE01( var )
41 #de f i n e PUSE10( var ) PUSE05( var ) PUSE05( var )
42 #de f i n e PUSE11( var ) PUSE10( var ) PUSE01( var )
43 #de f i n e PUSE12( var ) PUSE10( var ) PUSE02( var )
44 #de f i n e PUSE13( var ) PUSE10( var ) PUSE03( var )
45 #de f i n e PUSE14( var ) PUSE10( var ) PUSE04( var )
46 #de f i n e PUSE15( var ) PUSE10( var ) PUSE05( var )
47 #de f i n e PUSE16( var ) PUSE10( var ) PUSE06( var )
48 #de f i n e PUSE17( var ) PUSE10( var ) PUSE07( var )
49 #de f i n e PUSE18( var ) PUSE10( var ) PUSE08( var )
50 #de f i n e PUSE19( var ) PUSE10( var ) PUSE09( var )
51 #de f i n e PUSE20( var ) PUSE10( var ) PUSE10( var )
Listing G.4: config.h
1 PROG_NAME=exp_loop_depend_dependences
2
3 all : serial parallel
4
5 serial : $ ( PROG_NAME )−ser . cpp
6 @clang++ $ ( PROG_NAME )−ser . cpp −g −O0 −E > $ ( PROG_NAME )−ser . prep
7 clang++ $ ( PROG_NAME )−ser . cpp −g −O0 −o $ ( PROG_NAME )−ser
8
9 parallel : $ ( PROG_NAME )−par . cpp
10 @clang++ $ ( PROG_NAME )−par . cpp −g −O0 −E > $ ( PROG_NAME )−par . prep
11 clang++ $ ( PROG_NAME )−par . cpp −g −O0 −o $ ( PROG_NAME )−par −fopenmp
12
13 clean :
14 rm −rf $ ( PROG_NAME )−par $ ( PROG_NAME )−ser ∗ . pdf times∗ ∗ . inc ∗ . prep
Listing G.5: Makefile
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 90
1 #!/ bin /bash
2
3 # Set p r e c i s i o n and format o f the Bash time command
4 export TIMEFORMAT= '%3R ' ;
5
6 # How many times the program w i l l be executed to take average
7 Reps=10;
8
9 # What i s the f i x ed batch s i z e to use
10 Batch=50;
11
12 # Number o f vars and number o f uses to s imulate
13 NVars=( 2 4 6 8 10 12 14 16 18 20) ;
14 NUses=( 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20) ;
15
16 # Clean the log f i l e
17 mv times . log times . log . bkp
18
19 printf " " >> times . log ;
20
21 # Print the Table ' s headers
22 f o r NVar in ${NVars [ @ ] } ; do
23 printf " %4d" $NVar >> times . log ;
24 done
25
26 printf "\n" >> times . log ;
27
28 f o r NUse in ${NUses [ @ ] } ; do
29 printf "%3d" $NUse >> times . log ;
30 f o r NVar in ${NVars [ @ ] } ; do
31 . / exp_loop_depend_dependences−gen Ser $NVar $NUse 0
32 . / exp_loop_depend_dependences−gen Par $NVar $NUse $Batch
33
34 make 2>&1 >> /dev/null
35
36 SerAvg=0;
37 ParAvg=0;
38
39 f o r Rep in `seq $Reps ` ; do
40 # Execute the program and collect the execution time
41 Ser=1
42 Par=1
43 Ser=$ ( time ( . / exp_loop_depend_dependences−ser 2>/dev/null ←↩
1>&2 ) 2>&1 )
44 Par=$ ( time ( . / exp_loop_depend_dependences−par 2>/dev/null ←↩
1>&2 ) 2>&1 )
45
46 # Accumulate the average parallel and serial execution time
47 SerAvg="$ ( echo "$SerAvg + $Ser" | bc ; ) " ;
48 ParAvg="$ ( echo "$ParAvg + $Par" | bc ; ) " ;
49
50 done
APPENDIX G. SOURCES EFFECT OF LOOP-DEPENDENT DEPENDENCES 91
51
52 # Calculate average of the serial and parallel time
53 SerAvg="$ ( echo "$SerAvg / $Reps" | bc ; ) " ;
54 ParAvg="$ ( echo "$ParAvg / $Reps" | bc ; ) " ;
55
56 # Calculate speedup
57 Speedup="$ ( echo "scale=2; $SerAvg / $ParAvg" | bc ; ) " ;
58
59 printf " %1.2 f " $Speedup >> times . log ;
60 done
61
62 printf "\n" >> times . log ;
63 done
Listing G.6: run.sh
