Using dynamic information to find vector parallelism by Evans, Graham C.
c© 2016 by Graham Carl Evans
USING DYNAMIC INFORMATION TO FIND VECTOR PARALLELISM
BY
GRAHAM CARL EVANS
DISSERTATION
Submitted in partial fulfillment of the requirements
for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2016
Urbana, Illinois
Doctoral Committee:
Professor David Padua, Chair, Director of Research
Professor Bill Gropp
Professor Wen-Mei Hwu
Doctor Simon David Hammond, Sandia National Laboratories
Abstract
Vectorization is key to performance on modern hardware. Almost all architectures include
some form of vector instructions and the size of the instructions has been growing with newer
designs. To take advantage of the performance that these systems offer, it is imperative that
programs use vector instructions, and yet they do not always do so. To take advantage of
vector hardware requires special instructions and since compliers only automatically generate
them in simple cases the programmers need to work to use them. This requires programmer
time and is often not portable. We believe that tools are needed to help guide even expert
programmers.
In this work we present the development of Vector Seeker, a tool to investigate vector
parallelism. Our approach is to optimistically speculate on the parallel potential of codes
by instrumenting original code and using that to find independent instances of the same
instruction during the execution. We describe the preliminary work in which we developed
a tool called MemVec, and how the limitations in that approach led to the development of
Vector Seeker. We then describe Vector Seeker and verification testing of the tool on several
benchmarks. Finally, we extend Vector Seeker to handle more production scale codes and
describe our experiences with a large CFD code, PlasComCM.
ii
For Jen and Nora.
iii
Acknowledgments
I would like to start by thanking David Padua my advisor and my all of my committee
members for their time and support. Though he was not there at the end I would also like to
thank Seth Abraham who helped to start this process. Without the revitalization I received
during my internship at Sandia this would have been a much more difficult process so I want
to thank both the lab and the many friends I made while there.
Finally I want thank all of my extended family and friends who both put up with my
travels and time away from home and encouraged me to finish. Their support and faith were
needed for me success.
iv
Table of Contents
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
Chapter 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Chapter 2 Related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1 Measuring Parallel Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Vector Potential . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Chapter 3 Preliminary Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Chapter 4 Vector Seeker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.1 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Code Instrumentation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.3 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Chapter 5 Using Vector Seeker . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.1 Running Vector Seeker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2 Interpreting Vector Seeker Results . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2.1 File Format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2.2 Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2.3 Dependent Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.2.4 Missed Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.5 Indirect Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.6 Reduction Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
Chapter 6 Vector Seeker Extensions . . . . . . . . . . . . . . . . . . . . . . 47
6.1 Automation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6.2 Capability Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.1 Block Linear Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2.2 Dynamic Block Tracing . . . . . . . . . . . . . . . . . . . . . . . . . . 51
6.2.3 Threading Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
v
Chapter 7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
7.1 PACT and Media Bench II Manual Testing . . . . . . . . . . . . . . . . . . . 56
7.2 Automated . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.1 TSVC Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
7.2.2 Numerical Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.3 Performance Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
7.3.1 Memory and Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
7.3.2 Threading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.4 PlascomCM Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
7.4.1 Verification of Initial Optimizations . . . . . . . . . . . . . . . . . . . 72
7.4.2 NS BS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
Chapter 8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
vi
List of Figures
3.1 Memory Model Vector Speedup . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 s1113 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Simple Loop Speedup Base Model . . . . . . . . . . . . . . . . . . . . . . . . 13
3.5 Tiled Memory Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.6 Simple Loop Speedup Tile Model . . . . . . . . . . . . . . . . . . . . . . . . 14
3.7 Blackscholes Main Work Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.1 Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Simple Loop Dynamic Dependence Graph . . . . . . . . . . . . . . . . . . . 21
4.3 Simple Loop Dynamic Dependence Graph After Pruning . . . . . . . . . . . 22
4.4 Greedy Instrument . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.5 Simple Loop Assembly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.6 Instrumentation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.7 Simple Scope Function Example . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.8 Simple Scope Chain Example . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.1 Log File Line Section Format . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2 Disassembly of Simple Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Mintest Simple Vector Log Section . . . . . . . . . . . . . . . . . . . . . . . 40
5.4 Disassembly of Dependent Loop . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.5 Dependent Loop Log Section . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.6 Disassembly of Missed Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.7 Mintest Loop Variable in Vector Memory Log Section . . . . . . . . . . . . . 42
5.8 Disassembly of Indirect Loop . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.9 Indirect Loop Loop Log Section . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.10 Unroll of Indirect Loop Code . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.11 Disassembly of Reduction Loop . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.12 Reduction Loop Log Section . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6.1 Switch Statement Body . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
6.2 Dynamic Basic Block Calls vs Standard Calls . . . . . . . . . . . . . . . . . 54
7.1 Acronyms used in Table 7.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
7.2 TSVC Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
vii
7.3 s122 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.4 Numerical Recipes Loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.5 Memory Overhead with Different Vector Seeker Versions . . . . . . . . . . . 68
7.6 Time Overhead with Different Vector Seeker Versions . . . . . . . . . . . . . 69
7.7 Time Overhead from Threading Support . . . . . . . . . . . . . . . . . . . . 71
viii
List of Tables
3.1 Results of MemVec on TSVC Loops . . . . . . . . . . . . . . . . . . . . . . . 14
3.2 Results of MemVec (Write Only) on TSVC Loops . . . . . . . . . . . . . . . 15
3.3 Results of MemVec on PARSEC . . . . . . . . . . . . . . . . . . . . . . . . . 15
7.1 Results for PACT and Media Bench II Applications . . . . . . . . . . . . . . 58
7.2 Block Results on Mantevo Mini-apps . . . . . . . . . . . . . . . . . . . . . . 67
7.3 Vectorization Status of PlasComCM . . . . . . . . . . . . . . . . . . . . . . 73
7.4 Vector Seeker Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
ix
Chapter 1
Introduction
Microprocessor vector extensions are now key components in most designs, and utilizing them
effectively is key to performance. Vector operations replace several instances of an operation
and execute them in parallel as a single operation. The architecture and the size of the
vector registers determine what operations are supported and how many instructions can be
executed simultaneously. Recently, these registers have been growing in size, reaching 256
bits with the Intel AVX and 512 bits in the Xeon Phi. Unlike most of the other improvements
in processor design, taking advantage of vector extensions requires programs to be changed.
There are three common methods for incorporating vector instructions into programs:
writing in assembly language, using intrinsic functions, and using a vectorizing compiler.
In the case of assembly, the code must be rewritten to take advantage of changes in the
architecture. Programs written using intrinsics are more portable but usually are still tied
to particular architectural parameters, and porting these programs requires significant pro-
grammer time. Unfortunately, vectorizing compilers fall short of the completely automatic
ideal, and often much programmer assistance is needed to get reasonable performance. With
finite budgets, this leaves the question of where to apply programmer time. It is our be-
lief that with the best information on where to apply their effort programmers can in most
1
cases transform the code with minimal annotations such that vectorizing compliers can do
the vectoization with out the need for either intrinsics or assembly. The advantage of this
approach is that it provides the most portability across architectures going forward.
In this thesis we present the development of a tool to help find the vector potential of
existing code. This tool identifies opportunities for vectorization in executed code first to
help guide manual optimization efforts.
This tool can also be used by compiler writers in improving auto-vectorization passes.
The intention here is that using this tool they can find potential parallelism that is not
currently exploited and use this information to guide improvements in the auto-vectorizers.
We believe programmers need a tool that has an architecture-independent view of vector-
ization. The choice to build the tool with an architecture-independent structure decouples
the question of vector potential from specific architectural features. This will allow the tool
to be useful even as vector units evolve. The main result of this choice is that the tool cannot
make assumptions as to what instructions can be vectorized, nor can it make assumptions
about what access patterns are supported by the architecture. Thus we propose measuring
the performance of existing programs on an ideal vector machine rather than any particular
machine. This approach was largely inspired by the work of Kumar [17], who in the late
1980s developed a tool to measure the amount of parallelism implicit in scientific Fortran
programs. The key difference between this work and what we are doing is that Kumar was
interested in general parallelism, not SIMD parallelism. This makes the task of approximat-
ing the maximum simpler since there there is no need to align the instructions such that
only multiple instances of the same instruction run at a time.
The ideal vector machine is not one that can be implemented in any hardware but rather
one with an unlimited number of registers, unbound memory, and unlimited width vector
units that handle all instructions other than control flow. This machine would be able to
execute any vector operation coming from a conventional loop in one unit of time regardless
2
of the memory access pattern. We use this ideal machine to measure the vector potential
to approximate the maximum potential of the program being analyzed. Constraining the
machine to some fixed architectural choices would reduce vectorization potential and, while
the ideal machine is never reachable, it will allow a comparison between current hardware
and the best possible hardware.
Computing an approximation to the best performance on the ideal vector machine gives us
an approximation to the upper bound of the vector parallelism available to any conventional
program. To do this, we have several choices. First, we could use compiler technology
to translate the program into vector form, but this would have the same limitations that
vectorizing compilers already have. Our choice instead is to use program traces, as Kumar
did, to find general types of parallelism, and seek vector operations in those traces.
The core problem of using traces is that the size of the complete trace of a program
is typically too large and time consuming to generate. Thus, rather than saving the trace
in a file and analyzing it, we stream the trace to code that does the analysis on the fly.
This way, we never need to produce the actual trace. In the case of finding the maximum
parallelism the on the fly, analysis is straightforward as discussed by Kumar. The case of
SIMD parallelism is not so easy to analyze since we can only execute operations at the same
time if they are of the same type. Thus we cannot assume that operations are executed at
the earliest possible time as was done in Kumar’s COMET system.
The result of this work is Vector Seeker, a tool for measuring vector potential in con-
ventional codes. This tool uses an architecture agnostic approach to model vector potential
and does not require access to source code to run. When the source code is available, Vector
Seeker provides guidance to programmers to optimize existing codes. We describe the tech-
niques used to implement Vector Seeker and how we differentiate potential vector operations
from loop record keeping operations.
During the work for this thesis there were several stages of tool development. The initial
3
stage was to experiment using simply memory accesses to model vector parallelism. This
initial tool can provide some insights into behavior of programs but does not perform well
when confronted by complex loops with function calls. The baseline Vector Seeker was
extended to be able to run on production code. These codes provided challenges in both
runtime and complexity that needed to be addressed to allow Vector Seeker to run on them.
In the end we present the results of Vector Seeker being used on a production scale code to
help expert programmers to vectorize the code.
The thesis proceeds as follows; first in chapter 2 there is a survey of related work. Then
we cover the development of Vector Seeker with preliminary models described in chapter 3
and the initial version described in chapter 4. There is then a discussion of how to use Vector
Seeker in practice in chapter 5. This is followed in chapter 6 with the extensions to Vector
Seeker both for performance and ease of use. Finally in chapter 7 we have the experiments
with the different versions of the tool, and in chapter 8, concluding thoughts. In this way
we cover the whole of Vector Seeker as it developed into a tool to help expert programmers
deal with the challenges of vectorization.
4
Chapter 2
Related
While there is extensive work covering auto-vectorization and finding vector performance in
compilers we will not be covering this since our interest is in finding potential rather than
directly exploiting it. To that end, we will discuss two works in the area of measuring parallel
potential in programs. These works are direct predecessors of our work in that they try to
measure the parallel potential rather than auto-parallelize, and we have drawn heavily on
the techniques presented in them. Finally we will discuss a work that, like ours, attempts
to measure vector potential in programs using tracing.
2.1 Measuring Parallel Potential
In the area of measuring parallel performance, the key work we drew on is that of Kumar [17].
In this work Kumar describes the COMET system, which measures the parallelism found in
Fortran programs as executed on an idealized parallel machine. The key idea in the COMET
system is to measure the amount of parallelism that could be extracted from a given program
with no transformations on this idealized parallel system by monitoring the execution of the
untransformed program. This idea is quite similar to the Vector Seeker system, but we are
5
searching on a transformed program to find vector parallelism.
The ideal parallel system that COMET uses is quite similar to the ideal vector system
that we imagine. It has an unbounded number of functional units, much like we have an
unbounded vector width. It assumes, as we do, that all changes are seen globally, and thus
there is no need for added synchronization. It assumes that once a value is computed it is
available from that point on for any instruction that will consume it. It uses an execution
model in which all instructions that have their dependencies satisfied will execute in one
clock step. This is similar to our model, in which we assume all vector instructions that
have their dependencies satisfied would be executed within one time step. There is a slight
difference here in that we assume that all instructions that do not depend on a vector location
are executed prior to the start of vector execution. This difference is where we change
from executing on an untransformed program to executing on a program that has been
transformed. From this execution, COMET produces a histogram of the parallel execution
that is similar to the results produced by Vector Seeker.
The COMET system is built as a source-to-source translation on Fortran that produces
an instrumented program from the program that is to be examined. This translation will
then track, for each instruction, when in the idealized execution the instruction can be
executed. This is done using a shadow memory location for every memory location in the
original program. The shadow memory location contains the time, as defined by number of
predecessor instructions, at which the location was computed on the idealized machine. Since
the original sequential program was correct, there does not need to be more than one shadow
location for each memory location, even though on the ideal machine, values are available
for the whole of the execution. The system also maintains a set of CVARs, control variables,
that store when control flow statements are known. With the combination of these two sets
of variables, the time that a statement can be executed on the ideal machine is calculated
by simply taking the max of all variables that the statement depends on. Finally when a
6
statement executes, the count of statements executed at that time is incremented by one in
the histogram. At the end of execution, the histogram is given as the measurement of the
parallel execution potential of the program.
From this work we take the ideas of using the sequential execution to measure the parallel
execution on an idealized machine, as well as the specific technique of using shadow memory
to track the dependence depth at which a value is computed. We diverge from this model in
that we imagine a transformed program with the serial portion executed before the vector
portion. Additionally, rather than considering general parallelism, we restrict our work to
parallelism on the same instruction.
The MaxPar system by Chen [13] is another system that tries to measure parallel poten-
tial. The MaxPar system is based on the work by Kumar and is similar in that it is built
using shadow memory and tracing of a sequential program to measure the parallelism that is
available. The key difference between COMET and MaxPar is that rather than measuring on
a fully idealized system, MaxPar attempts to measure the parallel performance on realizable
systems.
This difference in approach is accomplished by adding systems that can constrain the
execution from the idealized system. There are three key areas in which MaxPar can be
configured to constrain the execution so as to represent a realistic system. These are memory,
processors, and synchronization. In the case of memory, while the default is for the system to
assume that once a value is computed it is available for the rest of the program, the system
can be made to assume that the location can not be reused until all the instructions that
depend on it have completed. In the case of processors, the system allows for scheduling
such that there is a limit to the number of parallel instructions that can be executed at
once. This is implemented with various scheduling systems to allow comparison of the
benefits of different schedules. Finally, the system allows for modeling the overhead from
synchronizations that may be needed in the parallel execution.
7
With these extensions, MaxPar provides an interesting system for studying how realistic
systems can directly exploit the parallelism in current programs. It also allows for studying
the advantages of different optimizations. In our work we have held closer to Kumar, though
the idea of restricting the parallel execution is clearly shared between us and MaxPar.
2.2 Vector Potential
Holewinski et al. [12] introduced a trace-based analysis that characterizes the vector potential
of codes. This work is similar to ours, but there are some key differences. In their work,
much like with MaxPar, they focus on finding potential that is exploitable on real hardware.
This differs from our more optimistic approach.
The system that they built is structured around a source-to-source system that will in-
strument a program to produce a trace that will later be analyzed. They use various front
ends to LLVM [19] to produce LLVM IR that they then instrument. With this instru-
mentation, they produce a trace from which they can build what they call the Dynamic
Dependence Graph, DDG. This graph will contain a node for each dynamic instance of an
instruction and an edge between each pair of nodes that has a flow dependence. They also
extract from the trace analysis the memory access patterns such that they can tell if accesses
are contiguous or of some fixed stride.
With the results of the trace, they then consider each instruction that operates on floating-
point data and check if there is a dependence loop such that the instruction cannot be
vectorized. This limitation of only examining floating-point instructions is imposed because
they believe that on current systems this is where vectorization will be profitable. This
is then augmented with the data on memory access patterns and if the pattern meets the
criteria required, contiguous or fixed stride, they report that the instruction is vectorizable.
This system is quite different from the parallel systems described above since rather than
8
summarizing the results during the tracing step they export the full trace of the region of
interest. This appears to have been chosen in part to deal with the problem they describe
wherein if there is a chain of dependencies such that every location in an array is at a differ-
ent depth, subsequent accesses to that array will be tainted by those dependencies despite
the later accesses being independent. Since they were trying to accomplish a conservative
estimate of vector parallelism, this could be a limitation on what they could find. In our
experience we did not find this pattern to be a significant problem.
The differences in our two approaches can be grouped into two categories: engineering
and philosophy. In terms of engineering, the key difference is that we do not generate a trace
at all. This reduces the total overhead and allows for us to generate profile information at
the same time as we complete the vector analysis. The second engineering difference is that
since we instrument the binary directly we can measure the vector potential in programs for
which we do not have all the code, such as programs with calls to libraries.
The philosophical differences start in the way that the two approaches think about can-
didate instructions. We partition instructions into candidates and non-candidates based on
the memory locations Holewinski et al. depend on while they consider only the floating point
operations. This allows us to think about vectorizing loops that are not floating point, as is
the case in most of the Media Bench II loops. The other key philosophical difference is that
we do not restrict vectorization to regular memory access patterns as they do. This allows
us to look for vector potential of architectures with scatter and gather instructions.
9
Chapter 3
Preliminary Model
Our first exploration into vectorization was an attempt to find a loose upper bound on vector
potential. The goal was not to find locations where vectorization was possible but to find in
a broad sense the maximum that could be achieved on any system. This would allow further
research on explicit vector opportunities to focus on cases where the potential was high.
Consider a modern program can be seen as a sequence of reads and writes to memory
locations. To maintain the same program it is reasonable to expect that the sequence of reads
and writes to a single location must be preserved to preserve the behavior of the original
program. Using this limit we postulated that even if the order of all memory operations
that are not to the same location can be reordered there should be a minimum number
of operations equal to the maximum number of operations done to any single location.
Given this idea we proposed an upper bound of speedup from vectorization in Figure 3.1
where Mserial is the total number of memory operations, Mvector is the maximum number of
operations on any single location. This supposes that at best the ideal vector machine would
take at least as many steps as it takes to cover the location with the most memory accesses
and that the base program would take at least as much time as the total memory accesses
in the original program.
10
Svec = Mserial/Mvector
Figure 3.1: Memory Model Vector Speedup
The idea was that this would provide an overestimate of the possible vector performance
as a model for measuring a single program but as a tool to compare two programs would
provide a useful system metric to compare potential performance.
To test this idea a simple tool MemVec was constructed, using Intel PIN [20], that
instruments all the memory accesses in the program and in a map for each memory location
keeps a count of the number of reads and writes to that location. This could then be run on
any given program and produce our vector speedup metric for that input on that program.
MemVec was run on several small loops from the the Test Suite for Vectorizing Compilers
(TSVC) assembled by Callahan, Dongarra and Levine [7]. These loops were compiled to test
the effectiveness of automatic vectorizing compilers. They were collected to test specific
features of the automatic vectorizing compilers and were originally written in Fortran. They
have since been ported to other languages and in our case we used the C versions of these
loops. These C versions were then extracted from the full benchmark such that each run
consisted of only a single loop with a single iteration. This was done so that the results of
each loop could be examined with as little noise as possible.
The results of running MemVec on some of the TSVC loops are shown in Table 3.1 as
the basic MemVec speedup. After running the loops with the default one dimensional array
size they were run again with arrays that were ten times larger than the default. This does
not apply in the case of loop s1232 since it operates on a matrix rather then a vector and
we did not change the matrix size of the benchmark. The results of this test show that, as
expected, these loops have significant vector potential.
There are two issues that showed up in these tests. First in the case of loop s1113 we
see very little vector potential when in fact there is quite a bit. Second there are issues with
11
comparing the results depending on array size.
1 for (int i = 0; i < LEN; i++) {
2 a[i] = a[LEN /2] + b[i];
3 }
Figure 3.2: s1113
Consider the code for loop s1113 in Figure 3.2. This loop is vectorizable by splitting the
loop into two loops one with the iterations before LEN/2 and another with the iterations
following. In the case of the basic MemVec speedup the issue is that every iteration of the
loop reads a[LEN/2] and this makes the number of accesses to this location equal to the
number of iterations in the loop and thus equal to the length of the array or LEN. This
prevents the expected speedup from being predicted by the simple heuristic in this case. In
fact the speedup found is entirely from the initialization loop.
This problem can be addressed by relaxing the requirement that all reads to a single
location must happen in order. This is a reasonable change since while in the original
program the location may need to be read several times when the value does not change it
could be reused rather then read from memory again. To model this change we remove the
count of reads from our instrumentation and simply count the number writes in all cases.
The results of these test can be found in Table 3.2. This produces much larger numbers then
the original MemVec in all cases but more importantly removes the anomaly seen in loop
s1113.
The second issue has to do with the impact of array size on speedup. Consider the very
simple loop in Figure 3.3. This loop is clearly vectorizable and when run through MemVec
will give a speedup. Assuming that i is a register variable and that this is a simple load
1 for(i = 0; i < N; i++)
2 A[i] = B[i] + C[i];
Figure 3.3: Simple Loop
12
Ssimple =
3N
1
Figure 3.4: Simple Loop Speedup Base Model
store architecture this speedup is given by the formula in Figure 3.4. Given that N takes
the value of N from the code we get that the speedup is entirely proportional to the size of
the arrays. This follows the model correctly but makes it hard to compare the results of one
loop or program to another since in many cases the difference in reported speedup will be
based more on input size than on underlying vector potential.
Figure 3.5: Tiled Memory Access
If you consider the basic memory model as simply measuring the size of the bounding
rectangle produced by the execution the tiled memory model measures the minimum number
of tiles to cover the memory accesses. This idea can be seen visually in Figure 3.5. Here we
pick some maximum vector width which is 5 in the visual example and tile the execution
with the minimum number of tiles that value is Mvector for this version of the memory model.
To give a concrete example consider again the simple code in Figure 3.3, assuming that each
array is aligned with the tiling we would compute speedup as shown in Figure 3.6. Here N
is again N and k is the number of array accesses that fit in one tile width of memory. This
case is straight forward since the height of each tile is one. If there were multiple accesses
13
Standard Vector Large Vector
Loop ID Basic MemVec Tiled MemVec Basic MemVec Tiled MemVec
s1113 6.55 4.51 5.16 4.46
s171 102.06 13.53 803.14 27.43
s211 211.11 19.27 1893.69 29.89
s1213 179.95 18.03 1582.10 29.50
s1232 136.02 15.11 - -
s243 211.11 19.27 1893.69 29.89
s2251 195.54 18.68 1737.90 29.71
s112 102.06 13.53 803.14 27.43
Table 3.1: Results of MemVec on TSVC Loops
to a single location in a particular loop the height of the stack of tiles covering that location
would be used for the contribution to Mvector of that stack. Here you can see that we are
controlling for vector size and rather than growing with the size of the array the speedup
will grow with the size of the tile until it reaches the size of the array. This model can then
be run with different tile sizes to measure the potential performance from different vector
widths.
Stile =
3N
3(N/k)
Figure 3.6: Simple Loop Speedup Tile Model
MemVec was modified to handle this tiled version and tests were again run against the
TSVC loops. The tile size chosen here was 256 bytes. The results of these tests are shown
in Table 3.1 as Tiled MemVec. These results show more stability with the change in vector
size. The difference that persists here is due to the changing fraction of the program that is
actually in the benchmark loop rather than in various startup code.
To further test MemVec we moved to the larger PARSEC [4] benchmarks. This bench-
mark suite was designed for modern multiprocessors and is known to have opportunities for
both general and vector parallelism. We ran four of the benchmarks blacksholes, bodytrack,
freqmine, and swaptions. All four of these applications have a data parallel structure and
14
Standard Vector Large Vector
Loop ID Basic MemVec Tiled MemVec Basic MemVec Tiled MemVec
s1113 578.97 20.68 5078.97 30.12
s171 578.97 20.68 5078.97 30.12
s211 1245.64 25.51 11745.63 31.16
s1213 1078.96 24.73 10078.96 31.02
s1232 932.97 23.16 - -
s243 1245.64 25.51 11745.64 31.16
s2251 1245.65 25.51 11745.65 31.16
s112 578.96 20.68 5078.96 30.12
Table 3.2: Results of MemVec (Write Only) on TSVC Loops
Program Basic MemVec Tiled MemVec
blackscholes 9.48 3.35
bodytrack 9.39 4.62
freqmine 59.51 3.96
swaptions 20.42 4.17
Table 3.3: Results of MemVec on PARSEC
have some level of parallelism. The results of these tests are in Table 3.3. The results from
bodytrack, freqmine, and swaptions are reasonable based on the structure of the programs.
In the case of the blackscholes benchmark the results were nothing like was expected.
MemVec greatly underestimated the amount of vector parallelism. This result was a large
surprise and demonstrated a significant shortcoming in this memory only approach since
the blackscholes benchmark is known to have very good vector potential even if it is not
automatically vectorized in most cases.
To explain these results we need to look at what happens in the blackscholes benchmark.
Consider the main work loop from the blackscholes benchmark in Figure 3.7. This loop calls
BlkSchlsEqEuroNoDiv for each element of the input. The function BlkSchlsEqEuroNoDiv
is pure with no side effects and if inlined, this loop is clearly vectorizable. However due to
the size of the loop, autovectorizers may not try even if the function is inlined. The problem
for our technique is that there are enough arguments that they are not all passed in registers
15
and thus the whole body of the function works on locations on the stack. Since the function
is called repeatedly from a simple loop call site, these stack locations are reused on each call.
This produces locations with at least as many accesses as there are iterations in the work
loop and thus MemVec predicts no speedup for that loop. In this case, since the accesses
are writes, removing the reads does not resolve the problem.
1 for (i=start; i<end; i++) {
2 price = BlkSchlsEqEuroNoDiv( sptprice[i], strike[i],
3 rate[i], volatility[i], otime[i],
4 otype[i], 0);
5 prices[i] = price;
6 }
Figure 3.7: Blackscholes Main Work Loop
All versions of MemVec provide some guidance to vector potential from a particular
execution, but can miss potential in some key ways. First, the use and reuse of memory
locations as temporary storage introduces dependencies that are not part of the abstract
program but simply part of the implementation. This shows up clearly in the case of the
blackscholes benchmark but could easily show up if a loop index were to be stored on the
stack rather then in a register. This problem could be addressesed by limiting the locations
in memory that are considered to only include ones that are part of the core program rather
than indices and register spills. The second problem with the memory only approach is that
it fails to restrict parallelism in cases where there are dependencies beyond the scope of the
order of operations on a single memory location. This causes the upper bound expressed to be
overly optimistic. The final issue is that a memory access only model does not differentiate
between two loops with the same number of memory references but different numbers of
operations. This will miss the advantage of loops with a larger work to memory ratio, which
tend to benefit more from vectorization since they are more likely to be computation bound.
MemVec while useful for understanding some elements of program performance it clearly
is not enough to make specific predictions on vectorization. On the other hand the technique
16
of using tracing still has potential and is a key component of the follow on tools developed
to explore vectorization.
17
Chapter 4
Vector Seeker
Having learned about finding whole program vectorization estimates from our experiences
with MemVec, we set out to build a tool that would take those lessons and apply them to
finding specific locations that could be vectorized. Even with this new goal, we still focus
on finding upper bounds of vector potential rather than explicit transformations. The key
change from before is that rather then provide an abstract estimate for the whole program,
we want a tool that finds the specific operations or program locations that could be vector-
ized. Besides that change in focus, the new tool needs to avoid the problems encountered
in MemVec which missed vertor opportunities due to functions masking loops and reads
enforcing unneeded orderings. To do this we focus on dependence rather than simply count-
ing memory access to determine suitability for vectorization. This chapter is divided into
three sections: first a high level description of Vector Seeker followed by a description of the
concrete implementation of Vector Seeker and code instrumentation designed to work with
it, and finally a description of some key limitations of the tool.
18
1 for(i = 0; i < N; i++)
2 A[i] = B[i] + C[i];
Figure 4.1: Simple Loop
4.1 Design
Consider that two instructions can be executed at the same time whenever they are not re-
lated by a dependence. We say that there is a dependence from instruction A to instruction
B if A writes a value to a memory location that is later read by B. In the literature, this
type of dependence belongs to a class known as flow dependence [16] or true dependence [14].
There are other classes of dependences known as memory-related dependences, such as an-
tidependence (write after read), outputdependence (write after write) and inputdependence
(read after read). However we ignore these to approximate the upper bound of what is
vectorizable. This is reasonable since memory-related dependences can often be removed by
program transformations. To determine absence of dependences, Vector Seeker analyzes the
dynamic dependence graph, which is a graph with instruction executions as nodes, and edges
pointing from the instruction that writes a value to a memory location to every instruction
that reads that value. As described above, every instruction at the same depth in this graph
can be executed at the same time. In Kumar [17], described above, one can associate a depth
of zero with each node with no incoming dependences. For all other nodes the depth is the
maximum length over all paths originating at nodes with depth zero. In Kumar’s work, two
instructions are assumed to be executable in parallel with each other if they are at the same
depth. In the case of vector parallelism there is another constraint that must be met —
each instruction must be of the same type. This extra constraint would drastically limit the
parallelism found if the ideas above were implemented with no changes.
Consider the very simple example code fragment in Figure 4.1. This code is obviously
vectorizable since each iteration of the loop is independent of the previous iterations. The
19
dynamic dependence graph of the first two iterations of the loop is shown in Figure 4.2. This
graph shows that there is no dependence chain connecting the two instances of the addition.
Yet this does not work in our case, even for this simple example, since the first instance of
the addition takes place at depth two and the second instance takes place at depth three.
Thus the unmodified depth model is insufficient to find all vector parallelism.
To resolve this problem, Vector Seeker does not work on the full graph but rather on
a graph pruned by removing some nodes from the graph. The nodes that are removed are
conceptually all moved to before the start of the program — that is, moved to depth of
negative one. This represents the moving of index calculations from once each iteration to
once each vector width. This change is accomplished by partitioning memory into locations
that are assumed to contain “vectors” and locations that are not. Then, every node without
a predecessor that is a load from a location within a vector is removed from the graph.
If in the previous example we assume that i and N will not be used as vector locations,
this produces the graph in Figure 4.3. Now the depth in the graph and the identity of the
operands can be used to determine that the loads from B and C, the store to A, and the
additions can be executed as vector instructions.
The question remains of how to determine what is a vector location. We considered
using programmer annotations and array detection but settled on treating only dynamic
allocations as vector allocations. The biggest advantage of this method is that it requires
the least work from the user of the tool. We also allow the user, if needed, to annotate the
code and explicitly select what memory to consider. This can be used in cases where the base
system performs poorly. The primary idea is that scalar variables that we want to prune
out from the dynamic dependence graph will typically be on the stack and not allocated
explicitly in the program. We also find that the most important memory is dynamically
sized and therefore dynamically allocated.
There are two key problems with using dynamic allocations for marking vector memory.
20
i = 0
LD B[i] LD C[i]
B[i] + C[i]
ST A[i]
i++
LD B[i] LD C[i]
ST A[i]
...
B[i] + C[i]
Figure 4.2: Simple Loop Dynamic Dependence Graph
21
LD B[i] LD C[i]
B[i] + C[i]
ST A[i]
LD B[i] LD C[i]
ST A[i]
...
B[i] + C[i]
Figure 4.3: Simple Loop Dynamic Dependence Graph After Pruning
22
The first is that in some cases this choice will miss opportunities for vectorization. This can
happen with static buffers, such as when a buffer is processed as it is read in from a file. This
can easily be worked around by annotating the buffers. The second and more problematic
issue is when indices, or at least offsets, are stored in dynamic memory and used to initialize
loop index variables. In some cases this issue can be avoided by simply limiting the scope of
what is traced, but in the worst cases the automatic marking of dynamic memory must be
disabled and the user must annotate all memory of interest.
Since storing even this pruned graph would be impractical, Vector Seeker uses the number
of times each static instruction could be executed at each depth in the graph to identify vector
operations. For the simple example in Figure 4.3 we would find that the LD,+, and ST nodes
could all be executed N times at depths 0, 1, and 2 respectively.
To compute the depth of a node corresponding to an instruction I in in the DAG, Vector
Seeker only needs the depth of the instructions that computed the source operands. To
accomplish this, Vector Seeker maintains a Shadow Memory or SM as a global map from
memory addresses including register ids to depths. This map is initialized to ⊥. When
Vector Seeker sees a vector allocation, by hooking operating system calls, it updates the
shadow memory values of all allocated locations to the depth of the maximum of all of the
arguments to the allocation operation. This value tells us when the vector was allocated and
also, since it no longer has the value ⊥, that this location belongs to a vector. In addition,
every time a memory location or a register is assigned a value, the depth of the instruction
making the assignment is stored in the SM of the memory location or register. To compute
the depth of an instruction, all that is needed is to find the depth of the instructions that
computed the input operands. This value is in the SM .
The final result of Vector Seeker will be a list of instructions that might be able to be
vectorized. For each of these instructions, which are identified by their instruction pointer,
there will be a list of depths at which each instruction was seen at as well as the number
23
T ← max(shadow values of the instruction’s input operands)
I ← instruction address
if I is a vector allocation then
for all addresses A in allocation do
SM [A] = T + 1
end for
else if I is a vector deallocation then
for all addresses A in vector deallocation do
SM [A] = ⊥
end for
else if instruction = simple load or store with address A then
SM [A]← T
else if T 6= ⊥ then {instruction is part of a vector}
A← destination address
SM [A]← T + 1
RV [I][T + 1] + +
end if
Figure 4.4: Greedy Instrument
of times the instruction was seen at each of those depths. This is the raw result from
running Vector Seeker and will be stored in the Result Vector or RV . The Result Vector
is itself a vector, indexed by the address of the static instruction. Each element is a vector
indexed by the depth so that RV [i][T ] is the number of instances of instruction i that have
been executed at depth T and containing the number of instances of the static instruction
that were executed at depth T . This is stored using a map of maps to take advantage of
the sparsity of both instructions of interest as well as depths of execution for any given
instruction.
The core of the Vector Seeker algorithm is in Figure 4.4. The code takes three inputs:
the address of the instruction being instrumented, the Shadow Memory map SM and a
Result Vector RV . This code is then executed for each execution of an instruction in the
program as it executes. This requires that both SM and RV be globally available.
The code in Figure 4.4 proceeds as follows. First it computes the start depth of an
instruction by taking the maximum of the depth where all of its operands were computed.
24
This is the earliest time the instruction can be started. If the instruction is a vector allocation,
it updates the shadow region of all allocated memory to the current time plus one. In the
case of a deallocation, it resets the shadow of all deallocated memory to ⊥. This way, it
maintains the partition of the dynamic dependence graph into vector and nonvector sources.
If the instruction was simply a load, store, or combination of the two, it copies time T to
the shadow of the destination. If the T is ⊥ — that is, if the source is a scalar — this value
is copied to the destination to identify it also as a scalar location. This is assumed to take
no time so that moves that are not algorithmic, such as register spills, will not introduce
delays. Finally, if the computed time is not ⊥—that is, when at least one of the operands is
a vector —the shadow of the destination will be assigned T + 1 and the Result Vector will
have the value indexed by the instruction address and time T + 1 incremented to represent
the growing vector.
When the instrumented program completes, Vector Seeker will have a Result Vector that
contains the size and number of distinct vector operations that can be executed at each
static instruction. This result represents the number of vectors that can be constructed for
the program. In the ideal case a static instruction will have a single vector at some time T
with a count equal to the number of times the instruction was executed in the program.
To illustrate this procedure, consider the assembly code in Figure 4.5 which corresponds
to the code in Figure 4.1. Assume that the vectors A, B, and C have been allocated by an
operation at depth 1 so that in the initial state of SM all array locations have a value of
1. All other locations are assumed to be on the stack and not allocated, and hence have a
value of ⊥. RV is currently empty with no entries.
When the first instruction executes, Vector Seeker will look up the max of all arguments
to the LD instruction. Since the only argument is the constant value of 0, the value of T will
be ⊥. The value of I is 1 for this example. Given this SM [R0] will be updated with ⊥ which
is the value that it already had, and nothing will be done with RV . Line 2 is similar but
25
in this case, the value of SM [N] will be checked as the only argument and it will be found
to contain ⊥. Lines 3-7 will proceed similarly. The first instruction that updates anything
differently is at line 8. In this case the input operands for the instruction are R0 and R3 and
the memory location whose address is stored in R3 with the offset in R0. In this case the
values of both R0 and R3 will be found to be ⊥ when looked up in SM , but the value stored
at the address found from R3 with the offset in R0 will be one since it is the first element in
the B array. This will proceed through Greedy Instrument and update SM [R5] to contain 1.
It will not update RV since this instruction was a simple load or store. Line 9 will proceed
similarly.
Finally in line 10 RV will be updated. This will happen as follows: first, T will be set
to be 1 since there are two operands that are read in this instruction, R5 and R6, and SM
contains 1 for both of these registers. Then when this instruction is checked it is not a simple
load or store and has a depth not equal to ⊥ and this will take the final branch in Greedy
Instrument. Thus SM [R5] will be updated with 1 + 1, or 2, since the result of the add is
stored in R5. Finally I, which is the address of the instruction being executed, will be looked
up in RV . Since RV is empty, a new entry will be inserted into RV at 10. In this new map
T + 1, or 2, will be looked up and since this is a new map it will also be empty and thus
the value will be zero. Finally this entry in the map will be incremented and will end up
containing the value of 1.
The store at line 11 will update the destination with the value of 2 found at 5 but since
it is a simple load or store it will not change RV . Line 12 will have ⊥ as the only input
operand and thus SM will continue to have ⊥ at address R0.
In the next iteration everything will proceed the same until line 10, where we will again
find T to 1 and I to be 10 and thus will again look up 10 in RV . This time, however, we
will find that there is an entry here. When we look up 2 in that entry we will find that there
is already a value of 1 there, and this time it will be incremented to 2. The iterations will
26
1LD R0 0
2LD R1 N
3LD R2 A
4LD R3 B
5LD R4 C
6JGE R0 R1 loopout
7loopin:
8LD R5 R3[R0]
9LD R6 R4[R0]
10ADD R5 R6
11ST R2[R0] R5
12INC R0
13JLN R0 R1 loopin
14loopout:
Figure 4.5: Simple Loop Assembly
continue in this fashion until the loop ends. The final value of RV will be a single entry at
RV [10] that contains a single entry at 2 with a value equal to the value of N since the loop
will execute N times before it completes.
This approach to measuring vector potential misses one well known class of vectorizable
loops, reductions. Since reductions have by their nature a true dependence from one iteration
to the next, Vector Seeker in its current form will not recognize them as having vector
potential. It will find other instructions in the loop as vectorizable so it will find the loop to
be partially vectorizable.
4.2 Implementation
Vector Seeker is implemented using Intel PIN [20], which is a dynamic binary instrumentation
tool. This allows Vector Seeker to be implemented as the instrumentation of the machine
instructions and thus allows Vector Seeker to instrument any program that will run on an
Intel processor no matter the source language. When coupled with standard debugging
data these instructions can be tied back to the source code to improve the value to the
programmer. The final tool used to build Vector Seeker is the XED X86 Encoder Decoder
that allows Vector Seeker to easily decode the instructions in the program being instrumented
27
so as to get more data than is readily available directly from PIN.
Since Vector Seeker is built using PIN it is implemented as a shared library that is loaded
by the PIN runtime which, manages the instrumentation tasks as well as the running of the
target program. The Vector Seeker shared library then consists of three main sections:
initialization, the main algorithm, and finalization.
The initialization initializes the Shadow Memory (SM) and Result Vector(RV) as empty
global structures that will be used during the running of the code. In the base line these are
implemented as simple STL maps. Then PIN is instructed to instrument three categories of
instructions. First, calls to malloc and free are instrumented to update SM when memory is
allocated or deallocated. Secondly, PIN is instructed to instrument all instructions with the
code that implements Greedy Instrument such that each time any instruction is executed
this code will be called. Finally the start function is instrumented. Until this start function
is seen Greedy Instrument will be bypassed and when this function is left, the use of Greedy
Instrument will be disabled. If this function is called recursively all calls will need to finish
before tracing will be disabled.
The main algorithm implements Greedy Instrument, updating the SM and RV as appro-
priate. This uses data provided by PIN to locate the addresses, all memory updates during
the execution of the instruction, and XED to find other information on the instruction class
and operand as well as to decode the instruction for later reporting.
The finalization code is in two parts. The first, is called any time instrumentation is
turned off, will log the contents of RV to the output log. The second, which will be called
at the end of the program, will log the contents of RV if there was tracing still going on and
will also perform any cleanup needed such as closing log files.
28
1
2 void _tracer_array_memory(void *start , size_t length);
3 void _tracer_array_memory_clear(void *start);
4
5 void _tracer_traceon ();
6 void _tracer_traceoff ();
7
8 void _tracer_loopstart(long long id);
9 void _tracer_loopend(long long id);
Figure 4.6: Instrumentation Functions
4.2.1 Code Instrumentation
While the raw mode of Vector Seeker has real value, there are several extensions that require
access to data beyond the binary. When debugging information is available, Vector Seeker
can associate the vector accesses with the line in the source program. This greatly improves
the value of the results that are provided by the tool.
When the source is available, several instrumentation functions, listed in Figure 4.6, can
be invoked. These functions belong to three categories: memory partition control, tracing
control, and loop information.
The first category of functions allows users to control memory partition by marking
regions of memory as vector memory. This is useful in cases where there are buffers that
are stack allocated that may be good candidates for vectorization. The pair of functions
tracer array memory and tracer array memory clear allow the user to specify to Vector
Seeker when to mark a region as vector allocated by setting its Shadow Memory to 1 and
when to clear the Shadow Memory back to ⊥.
The tracer traceon and tracer traceoff provide for granular tracing regions. By
default Vector Seeker starts tracing when entering main and ends when it exits. These
functions allow for tracing more limited regions (e.g. functions or loops) to limit the range in
which to identify potential vector operations so that the analysis can be done more quickly
when the user is only interested in the vector potential of a limited region. Tracing is
managed like a stack so that when tracer traceon is invoked multiple times, tracing will
29
be enabled as long as the number of tracer traceon invocations exceeds the number of
tracer traceoff invocations.
The scope limiting features are very valuable in Vector Seeker to deal with two different
cases we encountered. The first is where the vector potential would require very extensive
restructuring. Imagine a decoding program such as the simple one in Figure 4.7. In this
program, the outer loop reads in A and then calls the decode function on it. The dependence
graph for the loop within “decode” forms a chain and therefore, if nothing else is done,
Vector Seeker would show no vector parallelism. However, since the outer loop will read a
fresh A each time, Vector Seeker will allow vectorization of the code on the outer loop. Since
this example is so simple, vectorizing the outer loop might be a reasonable approach, but
in more complex situations it might not be easy to do this outer loop vectorization. In this
case, limiting the scope to the decode function would show no vector potential but limiting
it to the outer loop would. This kind of decision must be made by the programmer and the
scope limiting functionality allows the programer to explore different choices.
The second case is where there is some initialization code containing dependences which
cause misalignments of the shadow values of an array before entering a vector loop. This case
was one of the motivations in the design choices of the Holewinski et al [12] work described
in Chapter 2. Imagine a simple code such as in Figure 4.8. The first loop computes the
prefix sum of A. This loop has a dependence chain, so will set the shadow values of A to
separate increasing values. The second loop is trivially vectorizable but is missed due to the
increasing values in the shadow of A. By restricting the scope to the second loop, Vector
Seeker can determine that it corresponds to a vector operation. To handle situations like
this, we implemented a post processor that will take the output of the whole program trace
and rerun each function scope that had instructions for which the RV indicates multiple
vector operations (i.e. RV [i][t] > 1 for different values of T ). The results from this are then
reported for the selected function scopes.
30
1 for(i = 0; i < N; i++)
2 read(A[M])
3 B[i] = decode(A[M])
4
5
6 decode(A[M])
7 total = 0
8 for(i = 0; i < M; i++)
9 total += A[i]
10 return total
Figure 4.7: Simple Scope Function Example
1 for(i = 1; i < N; i++)
2 A[i] += A[i-1]
3
4 for(i = 0; i < N; i++)
5 C[i] = A[i] + B[i]
Figure 4.8: Simple Scope Chain Example
The final category of instrumentation functions are used to simplify reporting down
to loops rather than instruction or source lines. The functions tracer loopstart and
tracer loopend allow for Vector Seeker to relate instructions to loop bodies as follows.
When Vector Seeker encounters tracer loopstart, it pushes the loop id argument onto a
stack of loop ids. Similarly, when it encounters tracer loopend, it checks that it matches
the top of the stack and pops the id from the stack. Then, when it reports the results, rather
than indexing them only by instruction address or source line number they are also indexed
by the loop id that was at the top of the stack when the instruction was encountered. The
key idea here is that with these functions, the final reporting can group instruction results
into loops rather than lines, which helps when comparing to the reports from compilers on
automatic vectorization.
4.3 Limitations
First there the limitations from the optimistic nature of trace-based data that will always
be limited by the input data that is used for the program run that is being traced. There
31
are several key limits to Vector Seeker. The other can be grouped into to two key categories:
performance and missed opportunity.
In the case of performance there are two separate areas to be considered: memory and
runtime. In the case of runtime, instrumenting every instruction with a complex algorithm
will always introduce a large overhead but this limitation can often be avoided by tracing
runs with small input and thus short initial runtime. In many cases this is sufficient; however,
there will always be cases that are interesting but too large to be handled by Vector Seeker.
This runtime issue is exacerbated in the case where there is little potential for vectorization
since in this case, the memory will be stressed as RV grows, with lots of small vector starts
that each require a separate entry. Later improvements described in chapter 6 address these
problems.
The limitations that lead to missed opportunities come from two sources: the sequential
requirement of the base Vector Seeker, and vector patterns that are not found. The first is
again addressed in chapter 6. The second is more complex but the two key parts are missed
transformations and hidden opportunities.
In the case of missed transformations, the simplest example is the case of reductions. It
is well known that reductions can be vectorized even though they have a true dependence
chain in their execution. This is possible when the reduction operation is associative and
works by transforming the execution to reorder the operations such that several can be
executed in parallel. This cannot be found directly by Vector Seeker since it respects all true
dependencies that it traces. In some cases the user, when examining the output, can infer
the presence of a reduction but we provide no automatic solution.
The second case of hidden opportunities comes from the choice to use the instruction
address as the grouping to ensure that all elements of a vector are executing the same
instruction. When a loop is unrolled by the compiler, there may be several instances of the
same instruction working on the same arrays but with separate instruction addresses. Again
32
we do not have an automatic solution to this problem, but often the user can work around
this either by disabling unrolling or by examining the final results to find this pattern.
33
Chapter 5
Using Vector Seeker
In this chapter we present some guidance on running and interpreting the results from Vector
Seeker. Vector Seeker is not a simple program and the output it produces from being run
on real programs is often quite large. In section 5.1 we discuss how to run Vector Seeker
and how to choose inputs. Then in section 5.2 we present several short examples of output
along with interpretations of the results.
5.1 Running Vector Seeker
Vector Seeker is implemented as a shared library that is used with Intel PIN. In this discussion
we will presume that the reader is familiar with the general use of PIN and instead we will
focus on issues that are specific to Vector Seeker. The key considerations in using Vector
Seeker are the choice of application configuration and the input set to use when instrumenting
the system.
Vector Seeker, much like a debugger, depends on debug information provided by the
compiler to associate the results of the trace with the source. Therefore it is important to
have debug information enabled when the application to be examined is built. In fact, by
34
default the tool will not report results on instructions that are not associated with any source
though this behavior can be changed if needed. The next most important choice after debug
information is optimization level. In general we recommend the use of -O1 with Vector Seeker.
This seems to provide the best compromise between performance and associating instructions
with the source. The use of higher levels of optimization may improve performance but it
often will cause problems with the debug information and make the results harder to interpret
due to the transformations of the code.
With the application built, the next problem is what inputs to use when running the
tool. While this choice is going to depend greatly on the program that is being examined,
there are some guiding principles that can be followed. First, it is of paramount importance
that the input used is fully exercising the area of code that is of interest. This is needed both
so that there is sufficient data to allow Vector Seeker to find potential in the code as well
as to allow Vector Seeker to see all dependencies that occur in real execution since Vector
Seeker will only track dependencies in the execution that traces. Second, the input needs
to be small enough that execution will complete given the time and memory constraints of
the user. The selection here is more limited by memory since the results that will need to
be maintained in memory can grow, in the worst case when there is no vector possibility in
the code, proportionally to the number of instructions executed in the run of the program.
Finally the input needs to be large enough that the loops of interest can be separated from
the rest of the data. In general this requires at least hundreds of iterations when the loops
are being traced in the context of full applications.
The final major choice is what area of the code to trace. By default Vector Seeker traces
from the start of main until main is exited. This will cover the whole program but in many
cases this will add more noise than value. Limiting the scope that is being traced has several
advantages. The first that execution time is reduced by limiting the time that tracing is
enabled. The second is that limiting the scope will limit amount of execution between two
35
separate instruction invocations that are grouped into a single vector execution. Finally,
limiting the scope can remove uneven dependencies that come from code before the region
that is of interest. This is one of the ways to get around the problem described in Holewinski
et al [12].
Vector Seeker provides two ways to limit the scope of the traced region. First the function
that starts tracing can be changed from main. This is done with the -f option. It is
important to remember that this option is takes the function name as named by the calling
convention not the original source program. For example, in the case of C++ this will require
the mangled name rather than the name from the program source. This method has the
advantage of simply requiring a command line flag. The limitation is that Vector Seeker
only supports a single function to start tracing with rather than a set, though it will trace
all functions that are called by the starting function during the execution of the code.
The second method is more flexible but requires modifying the code and linking with a
shared library. This works by adding function calls to the original code that will be seen by
Vector Seeker when the program is executing. The tracer traceon call will either start
tracing if it is not on or increment the tracing depth if it is on. The tracer traceoff call
will stop tracing if the depth is one otherwise it will decrement the depth. With this approach
there can be multiple traced regions for a single execution of the program rather than the
single region allowed with the command line technique. These calls give the programmer
maximum control over where to trace but they have the drawback that they change the code
of the program that is being examined.
5.2 Interpreting Vector Seeker Results
In this section we provide a short description of Vector Seeker log files. This is followed
discussion of how to interpret sections of the log produced by the mintest program which is
36
included with the Vector Seeker source. The full source listing for this program is available at
the Vector Seeker Git repository [8]. We will start with a simple loop and a dependent loop
to show the most straightforward loops. Then we present a loop with indirection. Finally, we
conclude with two cases where the loops are vectorizable but Vector Seeker does not directly
find the vector potential along with ways to mitigate these problems.
5.2.1 File Format
The standard log files generated by Vector Seeker consist of pairs of tracing and reporting
sections. There will be as many pairs as there are separate tracing sections during the run.
If no sections are traced the log file may be empty and by default, there is that there is only
a single pair.
The tracing section can contain large amounts of data when debugging options are turned
on but we will not be discussing them here, since they are for debugging Vector Seeker and
should not be of interest to end users. This section may also contain information on when
sections of memory are marked and released as vector memory. By default this is not enabled
but can be turned on with the -log-malloc command line option. Finally, each tracing
section can contain a pair of lines: tracing turned on marking the start and tracing
turned off marking the end. If there is such a pair then the tracing section will be followed
by a reporting section.
The reporting section starts with the line #start instruction log and ends with the
line #end instruction log. Between these two lines will be the results from the results
vector created during the preceding tracing section. This report is made of Line Sections
sorted first by maximum execution count, then by source file name and finally by line number.
If reporting on instructions with no debug information is enabled using the -show-all flag
they will follow in a single section after all instructions with source lines identified by the
debugging information.
37
The log file Line Section format is shown in Figure 5.1. Each Line Section starts with
the format seen in line 1. The source file will be the full path of the source file location as
provided by the debug information. If the source file is not available the this will instead
contain the string “NO FILE INFORMATION”. The line number is also based on the debug
information. Finally, the execution count is the largest count of any instruction traced by
Vector Seeker for that source line.
1 /path/to/source ,line number:execution count
2 instruction pointer:instruction_type:disassembled instruction
3 mangled_function_name
4 <depth ,length ><depth ,length >...
5 instruction pointer:instruction_type:disassembled instruction
6 mangled_function_name
7 <depth ,length >...
8 .
9 .
10 .
Figure 5.1: Log File Line Section Format
The source line will be followed by one more instruction result entries. Each instruction
result entry consist of three lines. The first is the instruction information line as seen in
Figure 5.1 line 2. The address is given in hexadecimal followed by the instruction class as
given by XED, and finally the disassembly of the instruction in Intel syntax. The next line
shown in line 3 contains the function name as seen by the runtime of the function that was
executing the first time that instruction was seen. For example in the case of C++ this
name will be mangled and thus different from the function as seen in the code. In the case
of inlining it also may not be the expected function. Finally, in very unusual cases there
may be multiple function calls that can reach the same instruction and this will be the first
function that reached that instruction. The final line in each section will be a series of angle
bracket enclosed pairs. Each pair represents a dynamic vector seen during the execution.
The first number is the depth at which the vector was seen, and the second number is the
number of executions of the instruction that were found at that depth. This final line is
38
often very long in fact, in the case of a completely unvectorizable instruction there will be
as many pairs as there were dynamic instances of the instruction during the traced region.
5.2.2 Simple Loop
To show the simplest case we start with the loop found in Listing 5.1. This is extracted
from the full listing at the end of the chapter. The disassembly can be found in Figure 5.2.
The relevant section of the log is found in Figure 5.3. The line executed was in mintest.cpp
at line 47 and was executed 40 times. The line only has one traced instruction: it was at
address 0x40077c and was an SSE add instruction adding the value in the address pointed
to into xmm0. This line was first encountered in the function Z5basicv and consists of a
single vector at depth 2 with a width of 40. In this case the interpretation of the results
is trivial. This is a very vectorizable section of the code. Every instance of the instruction
in the traced region could be executed as a single vector instruction with a width of 40
operands. This is the only instruction reported for this whole loop since this add is the only
non-move operation on vector memory.
45 for(int i = 0; i < COUNT *4; i++)
46 {
47 A[i] = B[i] + 1;
48 }
Listing 5.1: Simple Loop Code mintest.cpp
1 400764: mov eax ,0x0
2 400769: movsd xmm1 ,QWORD PTR [rip+0x4b7]
3 400771: mov rdx ,QWORD PTR [rip+0 x201918] # 602090 <B>
4 400778: movapd xmm0 ,xmm1
5 40077c: addsd xmm0 ,QWORD PTR [rdx+rax*1]
6 400781: mov rdx ,QWORD PTR [rip+0 x201910] # 602098 <A>
7 400788: movsd QWORD PTR [rdx+rax*1],xmm0
8 40078d: add rax ,0x8
9 400791: cmp rax ,0x140
10 400797: jne 400771 <_Z5basicv +0xd >
Figure 5.2: Disassembly of Simple Loop
39
1 /path/to/mintest.cpp ,47:40
2 0x40077c:SSE:addsd xmm0 , qword ptr [rdx+rax*1]
3 _Z5basicv
4 <2,40>
Figure 5.3: Mintest Simple Vector Log Section
5.2.3 Dependent Loop
Now we present a simple case that has a loop carried dependence and can not be directly
vectorized. The code extract is found in Listing 5.2, the disassembly in Listing 5.4, and
finally the relevant log section is in Figure 5.5. In this case the last line in the log is
wrapped so as to fit on the page. Once again there is a single instruction of interest, since
it is the only non-move operation on vector memory. The difference is that this time every
execution happens at a different depth and thus there are 29 separate vectors listed each of
length one. This indicates that there might be a loop-carried dependence. Looking at the
loop it is quite clear that there is a loop-carried dependence on C and thus this loop cannot
be directly vectorized.
63 for(int i = 1; i < COUNT *3; i++)
64 {
65 C[i] = C[i-1] + 1;
66 }
Listing 5.2: Dependent Loop Code mintest.cpp
1 4007d0: mov eax ,0x0
2 4007d5: movsd xmm1 ,QWORD PTR [rip+0x44b]
3 4007dd: mov rdx ,QWORD PTR [rip+0 x2018a4] # 602088 <C>
4 4007e4: lea rcx ,[rax+0x8]
5 4007e8: movapd xmm0 ,xmm1
6 4007ec: addsd xmm0 ,QWORD PTR [rdx+rax*1]
7 4007f1: movsd QWORD PTR [rdx+rax *1+0x8],xmm0
8 4007f7: cmp rcx ,0xe8
9 4007fe: je 400805 <_Z9dependantv +0x35 >
10 400800: mov rax ,rcx
11 400803: jmp 4007dd <_Z9dependantv +0xd >
12 400805: repz ret
Figure 5.4: Disassembly of Dependent Loop
40
1 /path/to/mintest.cpp ,65:29
2 0x4007ec:SSE:addsd xmm0 , qword ptr [rdx+rax*1]
3 _Z9dependantv
4 <2,1><3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1><13,1><14,1><15,1><16,1>
5 <17,1><18,1><19,1><20,1><21,1><22,1><23,1><24,1><25,1><26,1><27,1><28,1><29,1><30,1>
Figure 5.5: Dependent Loop Log Section
5.2.4 Missed Loop
Now we present a loop that illustrates the problems that can arise when the memory selected
as vector memory is inappropriate for a particular use. The relevant code can be found in
Listing 5.3 with the disassembly in Listing 5.6, and finally Figure 5.7 contains the relevant
section of the log. For this run we have used the default behavior of using heap allocations
to select vector memory.
79 void loop_bounds_dynamic(size_t start , size_t end , double *data , double *data2)
80 {
81 for(size_t i = start; i <= end; i++)
82 {
83 data[i] = data[i]+data2[i]*2;
84 }
85 }
Listing 5.3: Missed Loop Code mintest.cpp
1 400838: cmp rdi ,rsi
2 40083b: ja 400859 <_Z19loop_bounds_dynamicmmPdS_ +0x21 >
3 40083d: movsd xmm0 ,QWORD PTR [rcx+rdi*8]
4 400842: addsd xmm0 ,xmm0
5 400846: addsd xmm0 ,QWORD PTR [rdx+rdi*8]
6 40084b: movsd QWORD PTR [rdx+rdi*8],xmm0
7 400850: add rdi ,0x1
8 400854: cmp rsi ,rdi
9 400857: jae 40083d <_Z19loop_bounds_dynamicmmPdS_ +0x5>
10 400859: repz ret
Figure 5.6: Disassembly of Missed Loop
When the code for this loop is examined it would appear that there should not be problem
vectorizing the loop. There are two operations that look to be vectorizable on line 83: the
41
1 /path/to/mintest.cpp ,81:11
2 0x400850:BINARY:add rdi , 0x1
3 _Z19loop_bounds_dynamicmmPdS_
4 <2,1><3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1>
5 0x400854:BINARY:cmp rsi , rdi
6 _Z19loop_bounds_dynamicmmPdS_
7 <3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1><13,1>
8 /path/to/mintest.cpp ,83:11
9 0x400842:SSE:addsd xmm0 , xmm0
10 _Z19loop_bounds_dynamicmmPdS_
11 <2,1><3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1>
12 0x400846:SSE:addsd xmm0 , qword ptr [rdx+rdi*8]
13 _Z19loop_bounds_dynamicmmPdS_
14 <3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1><13,1>
Figure 5.7: Mintest Loop Variable in Vector Memory Log Section
add and the multiply. Yet when the log is examined there are four operations rather than two
that appear and none look to be vectorizable. The key to understanding what is happening
in this function is found at the call site which can be found in Listing 5.4. Here we can
see that all three arguments come from chuncks which is a heap allocated structure. Thus
Vector Seeker will trace the operations not just on data and data2, but also start and end.
Thus we get entries for both the cmp and add in line 81. The key here is the add on line 81
since this add produces a real loop-carried dependence.
114 loop_bounds_dynamic(chuncks [0]. start ,chuncks [0].end ,chuncks [0].data ,chuncks [1]. data);
Listing 5.4: Missed Loop Call Site mintest.cpp
This problem preventing Vector Seeker from reporting the real vector potential of this
loop can be addressed in two different ways. First the programmer could limit the scope
that is traced by selecting to start tracing with the function loop bounds dynamic using
the f flag and giving the mangled named Z19loop bounds dynamicmmPdS . This would
resolve the problem since dependences are not tracked outside the traced area and both
start and end are not heap allocated and thus without tracing the dependence into the
function will not themselves be counted as vector memory. The second method would be
for the programmer to explicitly annotate the program to specify what memory should be
treated as vector memory and disable the use of the heap to mark memory. This would
42
allow the full program to be traced and still get the desired results but would be much more
intrusive and time consuming.
5.2.5 Indirect Loop
Because VectorSeeker performs a dynamic analysis of the source, it can take into account
runtime information in its analysis. This can be both good and bad. On the one hand, it
can spot data-dependent vector potential, which is something that compilers are not able to
do. On the other hand, if this information is taken out of context, we risk misinterpreting
VectorSeekers output, which could lead to errors. To illustrate this we look at the loop in
Listing 5.5 with the disassembly in Listing 5.8. The relevant log segment can be seen in
Figure 5.9.
37 for(int i = 0; i < COUNT; i++)
38 {
39 A[I[i]] = B[i] + 1;
40 }
Listing 5.5: Indirect Loop Code mintest.cpp
1 400724: mov eax ,0x0
2 400729: movsd xmm1 ,QWORD PTR [rip+0x4f7]
3 400731: mov rdx ,QWORD PTR [rip+0 x201940] # 602078 <I>
4 400738: mov rcx ,QWORD PTR [rdx+rax*1]
5 40073c: mov rdx ,QWORD PTR [rip+0 x20194d] # 602090 <B>
6 400743: movapd xmm0 ,xmm1
7 400747: addsd xmm0 ,QWORD PTR [rdx+rax*1]
8 40074c: mov rdx ,QWORD PTR [rip+0 x201945] # 602098 <A>
9 400753: movsd QWORD PTR [rdx+rcx*8],xmm0
10 400758: add rax ,0x8
11 40075c: cmp rax ,0x50
12 400760: jne 400731 <_Z8indirectv +0xd>
Figure 5.8: Disassembly of Indirect Loop
The output of VectorSeeker suggests that this loop is vectorizable since the add instruc-
tion is executed 10 times at the same dynamic depth. A quick glance at the source code
makes it clear why the compiler does not vectorize the loop since it would be unable to prove
43
1 /path/to/mintest.cpp ,39:10
2 0x400747:SSE:addsd xmm0 , qword ptr [rdx+rax*1]
3 _Z8indirectv
4 <2,10>
Figure 5.9: Indirect Loop Loop Log Section
that the indirect indexing of A[] using the values in I[] may contain values such that there
are dependencies from one iteration to the next.
However, if we look at the initialization for I[] in Listing 5.6 we see that each I[i] is
distinct. If we unroll the loop, it becomes the code in Listing 5.10. This is clearly vectorizable.
It is important to note that this is only true for this particular initialization of I[]. It may
not hold if I[] is initialized with different values. This highlights the fact that VectorSeeker
is merely a guide which serves to identify “loops of interest” to be examined more closely.
It should not be used as a silver bullet, which can correctly identify all vectorizable loops in
the program.
29 for(int i = 0; i < COUNT; i++)
30 {
31 I[i] = i;
32 }
Listing 5.6: Initialization Code for I mintest.cpp
1 A[0] = B[0] + 1
2 A[1] = B[1] + 1
3 A[2] = B[2] + 1
4 ...
5 A[COUNT -1] = B[COUNT -1] + 1
Figure 5.10: Unroll of Indirect Loop Code
5.2.6 Reduction Loop
The final case to be examined here is a loop that contains a reduction. The code for this
loop can be found in Listing 5.7 with the disassembly following in Listing 5.11. Finally
44
Figure 5.12 contains the relevant log section. This loop is clearly a simple reduction and it is
well known how to vectorize such a loop. The problem for Vector Seeker is that such loops
contain true dependencies and thus will never be reported to be fully vectorizable.
71 double c = 0;
72 for(int i = 0; i < COUNT; i++)
73 {
74 c += (A[i] * B[i]);
75 }
Listing 5.7: Reduction Loop Code mintest.cpp
1 400807: mov rcx ,QWORD PTR [rip+0 x20188a] # 602098 <A>
2 40080e: mov rdx ,QWORD PTR [rip+0 x20187b] # 602090 <B>
3 400815: mov eax ,0x0
4 40081a: xorpd xmm0 ,xmm0
5 40081e: movsd xmm1 ,QWORD PTR [rcx+rax*1]
6 400823: mulsd xmm1 ,QWORD PTR [rdx+rax*1]
7 400828: addsd xmm0 ,xmm1
8 40082c: add rax ,0x8
9 400830: cmp rax ,0x50
10 400834: jne 40081e <_Z9reductionv +0x17 >
Figure 5.11: Disassembly of Reduction Loop
1 /path/to/mintest.cpp ,74:10
2 0x400823:SSE:mulsd xmm1 , qword ptr [rdx+rax*1]
3 _Z9reductionv
4 <2,10>
5 0x400828:SSE:addsd xmm0 , xmm1
6 _Z9reductionv
7 <3,1><4,1><5,1><6,1><7,1><8,1><9,1><10,1><11,1><12,1>
Figure 5.12: Reduction Loop Log Section
The question is, what can Vector Seeker tell the user about such loops that can help to
guide vectorization? Even without recognizing the reduction idiom there can be potential to
partially vectorize the loop. In this case, the first instruction found on the line (the multiply)
is fully vectorizable. The addition found on the same line, however, is fully dependent and
not directly vectorizable. One way of vectorizing this loop would be to perform scalar
expansion on the multiplication in a separate loop and perform the addition reduction in a
45
separate loop. However, a better approach would be to recognize the idiom as a reduction
and transform it accordingly. Vector Seeker cannot directly recognize the reduction, since in
the end the extra information provided by the tracing methodology does not give the needed
information. It is often possible for a human to look at the output of Vector Seeker and see
that there is a reduction, but automating this proved too be to imprecise.
This chapter has provided an overview of how to interpret the logs that Vector Seeker
generates. While these examples are very simple and more complex code will have more
distractors that may require traces at multiple scopes to interpret we believe this outline
should provide users with the key approaches the basic strategy holds.
46
Chapter 6
Vector Seeker Extensions
With the baseline model of Vector Seeker established there are several extensions that ex-
panded the value of the tool. These can be grouped into two categories: first extensions to
better correlate the results of Vector Seeker with the original program; and second, exten-
sions to increase the range of programs that can be usefully instrumented. In this work, we
limit the first category to some simple automation described below in the section 6.1. The
second category, capability extensions are described in section 6.2.
6.1 Automation
The core Vector Seeker tool is extended with some automation to support several instrumen-
tation and analysis tasks. These tools can be divided into preprocessing and post processing.
These tools were implemented in python using PLY [3].
The preprocessing tool takes standard C or C++ source and inserts the
tracer loopstart and tracer loopend functions with ids that match the line number
in the original source code. This tool only handles well nested for and while loops. It can
work either on the raw source or on preprocessed code, depending on whether there is code
47
in the headers that should be instrumented.
There are two post processing tools. The first automatically executes Vector Seeker on
each of the function scopes that are of interest. This works by looking at the log file of a
whole run and selecting all function contexts that access potential vectors of a size greater
than one. The tool then runs the program for each such function producing a set of logs,
each named with the function that was traced.
The second kind of post processing tool takes logs from Vector Seeker and produces loop
summary information. For each loop that was examined by Vector Seeker, this reports the
maximum average vector size for any instruction in the loop, the minimum average vector size
for any instruction in the loop, the maximum number of distinct vectors for any instruction
in the loop, the minimum vector size for any instruction in the loop, and the maximum
vector size for any instruction in the loop. This information allows for quick interpretation
of the vector potential of the examined programs and is used in Chapter 7 to examine the
TSVC loops and Numerical Recipes.
6.2 Capability Extensions
While the base Vector Seeker is a useful tool there are several ways to improve it. In this
work we present two extensions to address the performance and scalability of Vector Seeker:
Block Linear Memory and Dynamic Block Tracing. Finally we present extensions to support
multithreaded programs.
6.2.1 Block Linear Memory
One of the significant limitations on running Vector Seeker is the memory required by the
tool. When tracing a production size program the size of memory used could be large and
thus require a large Shadow Memory. In the baseline implementation, Shadow Memory was
48
implemented with a STL map. This implementation was simple and reliable but has roughly
O(n) overhead. With an entry for every byte accessed in the original program this overhead
is significant. When this memory is large the access time grows at O(log n) which on large
memory is significant.
To improve this situation, the first step attempted was to replace the STL map with an
unordered map. This improves access time from O(log n) to O(1), because the map is based
on a red-black tree and the unordered map is a hash table. The savings in memory overhead
were less significant since both structures have an overhead of O(n). However, there were
some savings since the overhead in the map is in pointers, which on 64-bit systems take
8 bytes and the data being stored for each location is only 4 bytes. We also found that
we got a less significant amount of performance improvement than expected. This was
partially due to the loss of the spatial locality that was present in the program that is being
traced. This locality is at least partially present in the memory accesses of the red-black tree
implementation of STL map and completely missing in the hash implementation backing the
unordered map. To resolve this we decided to devolve a new system we call Block Linear
Memory.
In implementing Block Linear Memory we noticed key features that we wanted to take
advantage of. First, while the programs that are being traced with Vector Seeker tend to
have good spatial locality, that spatial locality was completely lost when the program was
being traced. In part this happens simply due to the larger amount of memory that is
accessed with the tracing code that is added to the execution but it is exacerbated by the
implementation of Shadow Memory. Consider that for each access to memory in the original
program there will be a matching access to Shadow Memory by Vector Seeker and while
similar addresses in the original code will be close to each other in physical memory that
is not the case with accesses to a hash table. Recovering some spatial locality would both
improve the performance of the instrumentation code and reduce the added memory pressure
49
on the program being traced. The second feature is that while there are locations in memory
that have very high dependence depths and can require the full 4 bytes to store their depths
there are also several locations that never see dependence depths that are larger than can
be stored in a single byte.
Block Linear Memory is designed to take advantage of both these features. In implement-
ing Block Linear Memory we replace the implementation of Shadow Memory, which mapped
addresses directly to depths with an implementation that maps addresses to Shadow Cache
Lines. These Shadow Cache Lines contain an array of depths that can be directly indexed
into using an offset from the start. This change directly recovers the locality in the original
program by storing cache lines in the hash table rather than the byte. This will allow the
tracing code to take advantage of the spatial locality in the traced program. It also reduces
the overhead by reducing the number of entries in the hash table from the number of bytes
touched by the program to the number of cache lines touched. Finally this provides a place
to control the space allocated for each byte in Shadow Memory.
Block Linear Memory is implemented as a hash of address with the least significant bits
equal to the log of the number of elements in a cache line masked off to a Shadow Cache
Line structure. This allows for O(1) access to memory locations since the offset to the cache
line can be calculated directly and then accessed directly. This also allows for the values to
be stored densely in an array.
Consider the loops in the previous chapter. None of these would require more then a
single byte to store their depths; however, if the loop count of all the loops were increased by
a factor of 100 the dependent loops such as the reduction loop seen in Listing 5.7 would have
locations requiring depths greater than can be stored in a single byte. Yet much of memory
during this execution – for example all of the stack memory – would still only require a
single byte for each location. With the structure of Shadow Cache Lines we can control the
amount of memory allocated for storing the depths of each cache line.
50
This is implemented as follows. The storage for the depths array is dynamically allocated
and increased when there would be overflow. When a Shadow Cache Line is first accessed
the array that is allocated to store the depths is an array of bytes. This will handle many
locations including all locations that never depend on an array location. Then when a depth
that exceeds the amount that can be stored in a single byte is to be stored the Shadow Cache
Line allocates a new array with twice the space per location and copies the old data over
before finally storing the new depth that previously would not have fit in the array. This
takes time but with the doubling of memory the number of times this needs to happen is
limited.
The combination of these two techniques significantly reduces the memory overhead re-
quired to maintain the Shadow Memory. The exploitation of the locality present in the
program being traced also significantly improves the performance of Vector Seeker by im-
proving the cache performance.
6.2.2 Dynamic Block Tracing
In the original Vector Seeker the tracing code was called on every instruction in the execution.
As a result, there was a minimum of one function call for every machine instruction in the
uninstrumented program. To reduce the number of calls, we would like to instrument several
instructions after they have executed. This can be done as long as it holds that if the first
instruction in the sequence is executed then the whole sequence will be executed. If that
is the case, when the sequence is done executing we can record the results of that sequence
on both Shadow Memory and the Result Vector. This way we can reduce the total number
of calls that are required to instrument the code using a technique we call Dynamic Block
Tracing.
The choice for this type of grouping is the Basic Block [1] which can be defined as a
sequence of consecutive statements in which the flow of control enters at the beginning and
51
leaves at the end. To implement this we use the PIN tools version of basic blocks BBL,
which differs from the traditional basic block in a few key respects. First traditionally the
last instruction in a basic block will be some form of control flow such as a branch that
will end guarantees required for the block to continue. While BBLs will end with a branch
they can end early for several reasons including encountering cupid, and popf instructions.
The can also be terminated if they exceed the decode buffer in length. These changes while
unexpected cause no problems for our use of BBLs.
1 switch(i)
2 {
3 case 1: a += 1;
4 case 2: a += 1;
5 default: break;
6 }
7 return a;
Listing 6.1: Fall Through Switch Code
1 case2:
2 ADD R1 1
3 case1:
4 ADD R1 1
Figure 6.1: Switch Statement Body
There are two more interesting cases where BBLs differ from traditional basic blocks.
The first is the case of REP prefixed instructions: the REP prefix can be given to some
string instructions such that they repeat a specified number of times. These instructions will
appear at the end of the block in which they occur, as well as in a block on their own for
each execution beyond the first. The second case is an extension of the first in that in the
case of BBLs a particular instance of an instruction can occur in more then one BBL. This
differs from traditional basic blocks. Consider the assembly in Figure 6.1 which is the body
from the Listing 6.1. This is a standard switch statement with a fall through from case 1
52
to case 2 and a break. The body of this traditionally would produce two basic blocks – one
containing the ADD on line 2 and the other the ADD on line 4. In the case of PIN BBLs we
also get two blocks; first, a block that contains the ADDs from both line 2 and line 4 then a
second BBL that contains only the ADD from line 4. The key difference is that rather than
only allowing BBLs to have a single entry point there will be a separate BBL generated for
each entry point which may share instructions with other BBLs.
While these differences are important and failing to understand them can cause prob-
lems they do not matter for our usage. Our only real requirement we have is that when a
block completes every instruction in that block has been executed and PIN BBLs meet that
requirement.
Ideally, when using Dynamic Block Tracing there would be a single call to the instru-
mentation code either just before or after a BBL is executed. However that is not the case in
practice. In implementing Vector Seeker we have avoided the need to read machine state and
do address calculation, and we will continue to do this with Dynamic Block Tracing. Instead
we instrument every instruction that accesses memory, as well as the block, and push those
addresses onto a queue. Then at the end of the block we have the gathered set of addresses
to use for shadow memory updates, and we execute the block from an instrumentation per-
spective. This still eliminates all the calls for instructions that do not load or store. For
example, consider the Figure 6.2, in which there are calls to Vector Seeker in both modes
of tracing. The red calls are for Dynamic Basic Block Tracing and the blue calls are for
Instruction Tracing. The reduction in calls may not seem significant at first but in the case
of Dynamic Block Tracing all calls other then the last one are small enough that PIN should
be able to inline them. Thus they iccur less overhead then the standard instrumentation
call.
Finally, there is one last advantage to the Dynamic Basic Block calling style. There
will be significant locality advantages both for the instrumentation code and the program
53
Figure 6.2: Dynamic Basic Block Calls vs Standard Calls
being instrumented. This can be seen by the reduced disruptions to the flow of code in both
cases. In the end this system provides a reasonable speedup without greatly increasing the
complexity of the instrumentation code.
6.2.3 Threading Support
Many modern programs in high performance computing are built so that they require mul-
tithreading. This can be to support a communication thread or simply be design. The base
version of Vector Seeker cannot evaluate multithreading. To be able to find vector perfor-
mance in these programs Vector Seeker was extended to be able to run not just on these
programs.
While the base design of how Vector Seeker executes is fundamentally sequential this
is not the problem it might seem at first. The key insight is that even in the case of a
54
multithreaded program each thread has a sequential execution. Thus we must build the
system to trace each thread individually. While this does solve the core problem the same
way that we solve the problem in an MPI environment it has issues since in the threaded
case there are often more fine grained communications that may carry dependencies that
would impede vectorization. While this is potentially true in the case of MPI as well the
overheads involved do not lend them selves to these types of communications.
There is also an issue with the growth in memory that is required if each thread has a
full instance of Vector Seeker running on it there will be a huge amount of memory overhead
due to duplicate copies of Shadow Memory. This overhead is not needed since the in the
original program there is clearly only a single memory.
Design
These key problems lead to the following design for supporting multithreading for Vector
Seeker. There is also a tension between duplication and communication. If each thread has
a unique copy of all data structures there would be no need for synchronization and we could
run like the in the MPI. This would be the fastest solution but would have all the issues
described above. To find the correct balance point we need to use some shared data as well
as some private data. The major structures can be grouped into the following categories,
running state, instruction data, and results data.
The results data described as the Results Vector in the Vector Seeker algorithm descrip-
tion stores the results that will be reported at the end. This data can be quite large and so
from a memory saving point of view it would be nice to be able to share this data. This is
offset by the fact that this data is accessed frequently and during execution most accesses
are writes. Finally since in many cases threads will be executing in the same region several
threads would be writing to the same area in the results vector. Due to this we have chosen
to have separate Results Vectors for each thread.
55
Chapter 7
Experiments
To evaluate Vector Seeker, during development we performed a large number of experiments
covering the different versions of the tool. Here, we present the key findings from the various
stages of development as follows. First, we present the baseline results on the tool to establish
the usefulness of Vector Seeker. These experiments are followed by tests using the automated
loop marking system. Next performance on the improved tool is presented these results
include tests of threading support. Finally to show the viability of the tool in production
sized codes we present the results of using Vector Seeker with the PlasComCM code.
7.1 PACT and Media Bench II Manual Testing
We ran two sets of experiments to explore the effectiveness of Vector Seeker. The first
used two applications from Petascale Application Collaboration Teams (PACT) and eight
applications from Media Bench II [10]. For this evaluation, we focused on verifying the
success of our method as compared to the results of manual analysis. In the second set
of tests, we used the automated facilities to compare the result of autovectorizing with
ICC(version 13.1.3) to the vector potential found by Vector Seeker. These automated tests
56
were run against code from Numerical Recipes and code from the TSVC loops as modified
by Maleki et al [21].
In all of these experiments we ran Vector Seeker on the whole program and then, using
the automated post-processing facilities, scoped execution to each function in which we saw
vector activity. In all cases, the code was compiled with ICC(version 13.1.3). To get better
debugging information when running Vector Seeker, the following flags were used for the
code that was executed by Vector Seeker: -inline-debug-info -g. This yielded the best
performance of Vector Seeker.
To verify the results of Vector Seeker, we compared its results with earlier results on
manual vectorizing and autovectorizing by Maleki et al [21]. To this end, we ran Vector
Seeker against the two applications from Petascale Application Collaboration Teams (PACT)
and eight applications from Media Bench II that had been used in Maleki et al [21]. In this
case, we did not run an exhaustive search across all statements in the program seeking to
identify vector potential; we wanted to compare not with compilers but with hand coding
and not all loops were hand coded in [21]. Therefore, we instead examined only the most
executed statements that showed activity on vector locations. To this end, we recorded all
statements that were executed as frequently as any instruction within the loops studied in
[21].
We then analyzed the results from these executions by hand. This manual analysis took
about five minutes per loop. We examined each instruction that was processed by the tool
for each loop that had instruction counts equal to the smallest loop worked on in [21]. We
considered a loop vectorizable in two cases. First, a loop was considered vectorizable if all
of the instructions in the loop were either a single vector or contained large (at least eight
element) vectors. Second, we considered a loop that had a single arithmetic instruction that
was not vectorizable at the end of the loop to be vectorizable as a reduction.
In Table 7.1 we present our results. The Application column is the application that was
57
Application Function Perc
Maleki et al Vector Seeker Vector Seeker
Global Local
DNS
multadd 1 26.5% Manual Yes Yes
outerproduct3 1 16.7% Automatic Yes No-Inline
axpy 1 15.1% Automatic Yes No-Inline
axpy2 1 20.3% Automatic Yes No-Inline
vorticity x 1 7.4% Partial Automatic Yes No-Inline
vorticity y 1 7.4% Partial Automatic Yes No-Inline
vorticity z 1 6.5% Partial Automatic Yes No-Inline
MILC
mult su3 nn 1 26.6% Manual Yes Yes
add lathwvec proj 1 18.2% Manual Yes Yes
mult su3 na 1 29.9% Manual Yes Yes
fieldlink lathwvec 1 4.0% Manual Yes Yes
sitelink lathwvec 1 4.1% Manual Yes Yes
mult su3 an 1 2.1% Manual Yes Yes
mult add su3 matrix 1 NC Similar Manual Yes Yes
mult add su3 vector 1 NC Similar Manual Yes Yes
add su3 matrix 1 NC Similar Manual Yes Yes
mat vec sum 4dir 1 NC Similar Manual Yes Yes
mult add lathwvec 1 NC Similar Manual Yes Yes
JPEG Encoder
forward DCT 1
38.5%
Manual Yes Yes
forward DCT 3 Partial Automatic Yes Yes
jpeg fdct islow 1
30.8%
Automatic Yes No-Unrolled
jpeg fdct islow 2 Manual Yes No-Unrolled
grayscale convert 1 2.9% Partial Manual Yes Yes
rgb ycc convert 1 NC Scatter/Gather Yes Yes
h2v2 downsamplet 1 NC Scatter/Gather Yes Yes
JPEG Decoder
jpeg idct islow 1
62.1%
Manual Yes Yes
jpeg idct islow 2 Manual Yes Yes
ycc rgb convert 1 NC Scatter/Gather Yes Yes
H263 Encoder
SAD Macroblock 1 86.5% Manual Yes Yes
idctref 1 NC Automatic Yes Yes
H263 Decoder
conv420to422 1 44.4% Automatic Yes Yes
conv422to444 1 44.4% Automatic Yes Yes
conv422to444 1 NC Scatter/Gather Yes Yes
MPEG2 Encoder
dist1 1 77.3% Manual Yes Yes
fdct 1 NC Automatic Yes Yes
MPEG2 Decoder
conv422to444 1 17.61% Manual Yes Yes
conv420to422 1 14.81% Manual Yes Yes
Saturate 1 9.84% Manual No-Global No-Global
idctcol 1 9.30% Manual No-Global No-Global
store ppm tga 1 NC Scatter/Gather Yes Yes
MPEG4 Encoder
pix abs16 c 1 34.7% Manual Yes Yes
pix abs16 xy2 c 1 7.4% Manual Yes Yes
pix abs16 y2 c 1 3.0% Manual Yes Yes
pix abs16 x2 c 1 2.6% Manual Yes Yes
MPEG4 Decoder v resample 1 19.3% Automatic Yes Yes
Table 7.1: Results for PACT and Media Bench II Applications
58
NC Loop was not considered in the work by Maleki
Manual Loop was manually vectorized by Maleki
Automatic Loop was automatically vectorized by icc
Partial Manual Loop was partially auto vectorized by icc
Similar Manual Loop was similar to loop manually vectorized by Maleki
Scatter/Gather Loop requires scatter gather so not considered profitable
Yes Loop was vectorized automatically and not considered
No-Global Loop works on global memory and was not automatically found by
the tool
No-Inline Function was inlined so no function scope
No-Unrolled Function was wholly unrolled so no loop
Figure 7.1: Acronyms used in Table 7.1
studied. The Function column lists the function that had a loop that was vectorized followed
by the number of the loop in that function. The Perc column is the percentage of execution
of the loop over the whole program as reported in [21]. Loops not considered in [21] are
marked NC in this column. The next column summarizes the results from [21] In the case
of loops that were not previously considered we write our interpretation. Finally the two
Vector Seeker columns report in which case Vector Seeker identified the loop as vectorizable:
when run on the whole program, Global; and on the function scope, Local.
The results in Table 7.1 show that in most cases, the tool correctly finds the vectorization
potential in the codes examined. There are three types of cases where the tool fails to find
the potential. First, when loops are wholly inlined, the tool cannot find vector parallelism
in the function local context. This happens because the enclosing function is inlined, and
this is the case in several of the loops of the DNS when examined at the function level.
The next case is where the loop that should be vectorized is completely unrolled. This
happens in the JPEG Encoder. In this case, global scope found that there was potential to
vectorize but in the function scope, there was no vector potential found.
The final case where Vector Seeker fails is where the vector potential is on memory
locations that are not tracked, since Vector Seeker only considers operations that come from
59
memory locations that are allocated on the stack. In the case of the MPEG2 Decoder, the
code works on global arrays for the loops found in Saturate and idctcol. This can be fixed
by marking the global arrays using tracer array memory to mark the memory. With this
change, these loops are also found.
7.2 Automated
In this section we present tests using the loop marking system to compare the results pro-
duced by Vector Seeker to the results reported by the compilers. First we present results on
the TSVC loops then on code from Numerical Recipes.
7.2.1 TSVC Loops
To examine the performance of Vector Seeker on a larger set of codes, we first took the
TSVC loops and let ICC try to vectorize the whole benchmark. This was done with the
-O3 -vec report1 -vec-threshold0 flags. This gave a baseline of 128 loops that were
vectorized. This does not match the number reported in [21] because the count of 128
loops includes initialization and verification loops. It also includes loops that ICC reports as
vectorized but that were not reported as vectorized in [21] where a loop was only considered
vectorized if it achieved speedup over the loop with no vector instructions. That is, in [21]
the compiler reporting that the loop was vectorized was not sufficient for it to be reported
as vectorized, but in this experiment it was.
The code containing all TSVC loops was modified to run each loop a single time, since
there was no need to repeat loops for timing. The code was then compiled with the -O0
-inline-debug-info -g flags and that was run with Vector Seeker. The results were post-
processed to summarize the loop information. These results are then graphed in Figure 7.2.
The plot has the vector size along the x-axis and the number of loops that meet that threshold
60
100	  
110	  
120	  
130	  
140	  
150	  
160	  
170	  
1	   2	   4	   8	   16	   32	   64	   128	   256	  
Lo
op
s	  V
ec
to
riz
ed
	  
Vectorsize	  
Compiler	  
Some	  Instruc:ons	  
All	  Instruc:ons	  
Figure 7.2: TSVC Loops
61
on the y-axis. The Compiler line is the number of loops that ICC reported as vectorized.
The results from Vector Seeker are the Some Instructions and All Instructions. The first
case, Some Instructions means that some of the instructions in the loop that were examined
had a average vector size equal to or larger than the vector size on the x axis. The average
vector size is computed as the total number of dynamic executions of the instruction divided
by the number of distinct vectors for that instruction. This represents the case where some of
the instructions in the loop operating on vector variables can be vectorized but not all. The
All Instructions case is the same except that all instructions must have an average vector
size at least as large as the required vector size. In this case, all of the instructions in the
loop operating on vector variables can be vectorized with that vector size.
There are a few things to note. First, the top point of the graph is 159. This is the
total number of loops considered by Vector Seeker, since any loop that is executed may be
vectorized with a vector width of one element. Second, the gap between Some Instructions
and All Instructions that appears when the vector size moves from one to two represents the
loops that have a recurrence that occurs for some of the data in the loop but not all. This
is very often the case when the loop has reductions. Finally, the relatively flat behavior at
the end is due to the fact that in the benchmark the loops have very uniform trip counts.
The key idea demonstrated in this experiment is that Vector Seeker can locate the vector
parallelism that is found in the TSVC loops even in cases where a compiler has problems.
One such example can be seen in Figure 7.3. This code has had the outer timing and
repetition loops removed from the benchmark for clarity but is otherwise unchanged. The
loop is vectorizable in a relatively straightforward manner, but vector potential is missed by
ICC because ICC cannot resolve the constant propagation of variables n1 and n3 to allow
it to vectorize this loop. This type of analysis is always difficult for compilers since the
amount of code that would need to be analyzed, especially in a real code, is potentially
huge. However, Vector Seeker can find it.
62
1 main() {
2 int n1 = 1;
3 int n3 = 1;
4
5 s122(n1 ,n3);
6 }
7
8 int s122(int n1, int n3)
9 {
10
11 int j, k;
12 j = 1;
13 k = 0;
14 for (int i = n1 -1; i < LEN; i += n3) {
15 k += j;
16 a[i] += b[LEN - k];
17 }
18 }
Figure 7.3: s122
Not all loops that are found by Vector Seeker are vectorizable with loop transformations
given current hardware since some, by design, require scatter and gather instructions. One
of the strengths of Vector Seeker is that it will find these loops since, while current hardware
and compilers cannot exploit the potential, it is possible that the programmer can do so by
changing the underlying data structures.
7.2.2 Numerical Recipes
To examine the performance of Vector Seeker on a larger set of codes that is not examined
extensively, we chose Numerical Recipes [22]. Their main characteristic is that they are writ-
ten cleanly without the typical complications resulting from the performance tuning found
in many benchmarks. It was hoped that this code would therefore be more representative of
the code that an average programmer would produce.
We encountered one unexpected issue with the tool and Numerical Recipes. Vector
Seeker, by design, does not support the x87 instruction set due to the technical challenges
such support presents, and the belief that such code should be rare under 64-bit code gener-
ation. We found that many of the 289 programs in Numerical Recipes had x87 instructions
63
when built on our test systems. We were able to avoid most of the failures by manually
altering the random number generator. This did change the random distributions but for
our purposes did not impact the dataflow in the code. Finally, on any program for which we
could not easily remove the x87 code, we traced every function in the program separately
and reported results on all functions that had no x87 code. Using these techniques we were
able to report results on the following contexts:
• Run on 289 programs
• Results on 149 whole programs with no X87
• 916 loops executed in context of the 149 whole programs
• 521 more loops traced in functions from the 289 programs
• 1413 total loops examined in at least one context
We then analyzed the results from Vector Seeker using the techniques from above. This
produced the results seen in Figure 7.4. These results are quite similar to the results from
the TSVC loops. Again the compiler line reported is the number of loops that ICC reports as
vectorized using -O3 -vec report1 -vec-threshold0 and the Some and All Instructions
vectorized are the minimum average and maximum average vector lengths for the instructions
in the loop. The key difference is that the vector potential continues to fall rather then
flatline. Since this code is really a regression test rather than a benchmark, the size of the
input data produces very small trip counts. It is also the case that in this code, there are
many utility loops that occur in real code but do not occur in the TSVC loops since the latter
are just loop benchmarks rather than actual codes. These short loops cause the continued
falloff of vector potential. In a practical application, where performance optimization was
the goal, this would not be an issue since the loops with little to no execution could be safely
excluded.
64
Figure 7.4: Numerical Recipes Loops
65
7.3 Performance Experiments
To test the performance improvements of Vector Seeker we used the a subset of the Mantevo
Mini-apps [11]. This suite of mini-apps was developed at Sandia National Laboratories to
help with application performance design. These applications make for a reasonable proxy
for large scale scientific applications yet are small enough to allow for experimentation during
the course of development of the Vector Seeker extensions. Rather than using all the mini-
apps we selected four; CloverLeaf, miniAMR, Epetra Benchmark, and HPCCG.
• CloverLeaf Solves compressible Euler equations on a Cartesian grid, using an explicit,
second-order accurate method.
• miniAMR Computes a 3D stencil calculation with Adaptive Mesh Refinement.
• Epetra Benchmark Executes Epetra kernels for sparse matrix-vector, sparse matrix-
multivector and dense kernels.
• HPCCG An approximation to an unstructured implicit finite element or finite volume
application.
To examine the vector potential in these applications we look at the block level results
on these mini-apps. Rather than using loops as before which requires code annotations we
here look at the results on the PIN basic blocks. Here we take the most conservative possible
model and consider a block to be vectorized if there are instructions in the block that are
examined by Vector Seeker and all of them are a single vector. The results of this analysis are
in Table 7.2. This table is grouped into two categories Static Blocks and Dynamic Blocks.
Static Blocks are the set of blocks that were instrumented by Vector Seeker. The Static Block
Total column is the total number of blocks that are instrumented by the tool even if they
are not executed during the run. The next column is the number of different blocks that are
66
Static Blocks Dynamic Blocks
Application Total Vector Non-Vector Vector Total
CloverLeaf 12545 648(8.42%) 7694 290369(9.40%) 3089025
miniAMR 4055 39(2.26%) 1724 9624933(3.79%) 254131398
Epetra Benchmark 8900 6(0.17%) 3589 1118861(1.29%) 86887414
HPCCG 7322 12(.34%) 3488 78645772(5.87%) 1339131230
Table 7.2: Block Results on Mantevo Mini-apps
marked as vectorizable, with the percentage out of the total number of blocks instrumented.
The Non-Vector column is the number of different blocks that were executed during the run
but not marked as vectorizable. In the Dynamic Blocks section we report the number of
times blocks were executed, with the Vector category being the number of blocks that were
vectorizable and the percentage being the percentage of the total number of blocks that were
executed which is listed in the Dynamic Blocks Total column.
Examining these numbers, the results are decidedly mixed. It is not clear that the number
of blocks that are vectorizable is a good match for the true potential of these codes. This
is largely due to problems we had not encountered before wherein the initial conditions of
most loops come from dynamically allocated–thus, vector–memory. This caused there to
be loop carried dependencies on the loop variables in some cases. Since our interest in this
case is just to study the performance of the tool and resolving these problems would require
extensive instrumentation code to be added to the applications we did not pursue this any
further. Finally, it is interesting to note how little of the actual code is covered by the test
inputs. For example, in the HPCCG case, fewer than half the blocks instrumented were ever
executed.
7.3.1 Memory and Speed
To test Block Linear Memory and Dynamic Block Tracing we ran the four codes on a desktop
system with Intel R©Xeon R©E5-1607 with 16 GB of RAM. We ran three different versions of
67
0.00%	
5000.00%	
10000.00%	
15000.00%	
20000.00%	
25000.00%	
30000.00%	
35000.00%	
40000.00%	
45000.00%	
CloverLeaf-1.1	 MiniAMR	 EpetraBenchmarkTest	 HPCCG-1.0	
Baseline	 Memory	OpHmizaHon	 Full	OpHmizaHon	
Figure 7.5: Memory Overhead with Different Vector Seeker Versions
the tool: first the baseline with no extensions enabled, then a version with Block Linear
Memory enabled, and finally a version with both Block Linear Memory and Dynamic Block
Tracing. We did not run a version with only Dynamic Block Tracing since the design of this
feature was based on the assumption that Block Linear Memory would be present.
In Figure 7.5 we present the memory overhead from of each version of Vector Seeker on
the four mini-apps. The usage is so small in the case of CloverLeaf that the savings are
not very significant, but in the other cases the savings in memory are huge. This is most
significant in the case of MiniAMR where there are significant numbers of locations that
have shallow dependence depth and thus get the most benefit from the compression scheme.
Finally in all cases a small amount of additional memory is consumed with the basic block
optimization but in every case this is insignificant compared to the total memory usage.
68
0.00%	
20000.00%	
40000.00%	
60000.00%	
80000.00%	
100000.00%	
120000.00%	
140000.00%	
CloverLeaf-1.1	 MiniAMR	 EpetraBenchmarkTest	 HPCCG-1.0	
Baseline	 Memory	OpHmizaHon	 Full	OpHmizaHon	
Figure 7.6: Time Overhead with Different Vector Seeker Versions
Turning to the performance provided by these optimizations, in Figure 7.6 we have the
time overhead of each version of Vector Seeker on the four mini-apps. In this case we see
that the largest performance increase comes from the memory changes. This is not surprising
given how large the savings were and the recovery of locality that was caused by the change.
There was still some performance to be gained from the basic block extensions.
7.3.2 Threading
Testing of threading was composed of two parts. First, tests were run to verify the stability
of Vector Seeker with threading extensions in a multi-threaded environment. Then in a
single-threaded environment, tests were run to measure the overhead that was incurred by
the threading extensions.
69
To test the stability of Vector Seeker, tests were run using the ECC OpenMP micro-
benchmark [6] suite. These tests were chosen since they would stress the amount of threading
in the programs being traced. These tests turned up the problems with the initial design
of simply statically decoding the instructions. In these benchmarks there are several cases
where there are code paths that jump past the atomic prefix of instructions. These code
paths are not found by the static decoding.
To resolve this limitation in the static decoding we had to modify the Vector Seeker
decode to have a multiple reader/single writer lock on both the basic block data structure
and the instruction data structure. With this change, Vector Seeker ran cleanly on the
micro-benchmark suit.
Given the nature of these benchmarks, there are no real interesting results found by
Vector Seeker so we will not discuss them here.
To measure the overhead imposed directly by the threading extensions we once again ran
Vector Seeker on a subset of the Mantevo benchmarks. We ran on the same four mini-apps
we used for the standard performance tests. These tests were run on a Intel R©Xeon R©CPU
E3-1220 v3 with 3.10GHz processor with 16 GB of memory. These tests were all run with a
single thread so as to measure the overhead that is incurred over the baseline version.
The results of these tests can be seen in Figure 7.7. In this chart we present the time
overhead of the best threaded version when compared to the best version with no threading
support. Here we see that as the application grows, the overhead that is incurred grows. This
is due to the larger memory usage increasing the amount of locking that is incurred. The
end result of these tests shows that when possible threading should not be used with Vector
Seeker, but when needed it is possible to trace programs that require threading support.
70
0.00%	
20.00%	
40.00%	
60.00%	
80.00%	
100.00%	
120.00%	
140.00%	
160.00%	
180.00%	
CloverLeaf	 miniAMR	 EpetraBenchmarkTest	 HPCCG	
Figure 7.7: Time Overhead from Threading Support
71
7.4 PlascomCM Experiments
In order to test the Vector Seeker in a more realistic test case than the benchmarks used
previously we tested with PlasComCM. PlasComCM is a multi-physics solver that is being
extended by XPACC [2]. This code has been developed to allow investigation of compressible
viscous gases with a focus on turbulence. It has been used extensively in on large scale sys-
tems to investigate problems such as noise control of high-speed turbulent jets [15], indirect
combustion noise in a turbine stage [5], and high speed fluid-structure interactions [23].
PlasComCM is a standard MPI-based parallel program that is driven largely by the input
file describing the problem to be solved. In our investigations we have used an input that
is designed to run on a single processor since in the standard MPI model each processor
will have its own process. This simplifies the task of interpreting the results. The input
we selected was based on discussions with members of the XPACC center who were already
engaged in performance analysis of PlasComCM.
We will describe our experiments with the PlasComCM code in two sections. First we
will describe our work to replicate the vector results already found by the center. Then we
will describe our predictive results on the NS BC subroutine.
7.4.1 Verification of Initial Optimizations
Prior to our involvement with this code work had been done to tune the application for single
threaded performance. This work which included improving the vector utilization of the code
as well as general optimization is descried by Zhang[24]. We started our involvement in the
code by trying to predict the vector potential on the sections that were optimized by this
work. Here we proceeded much the way that we had in Section 7.1.
Once again we started by running against the whole program and then running again
on the functions of interest. Since we had the profiling information from the previous per-
72
Subroutine Loops Vectorized Initially Vectorized After Tuning
VISCOUSTERMS STRONG 17 2 17
SPARSE NEW 1 0 1
CID 3 3 3
ARK2 10 0 10
NS RHS STRNRT 5 2 5
INTERNAL DERIV 3 1 3
NS RHS EULER 9 2 9
COMPUTEDV IDEALGAS 2 0 2
Total 50 10 50
Table 7.3: Vectorization Status of PlasComCM
formance tuning we limited our investigation at this stage to functions that we studied by
the previous performance tuning. The initial state of the vector code as studied by the
previous work can be seen in Table 7.3. In this table we adopt the same naming convention
for Fortran subroutines as Zhang did in the work above. This table reports the vectorization
status of the inner loops in all of the top ten time consuming subroutines of PlasComCM
with two exceptions: NS BC, which was not studied by the previous work due to the com-
plexity of the code; and NS ALLOCATE MEMORY, which as an allocation routine was not seen
to be profitable to work on directly. The Loops column reports the number of inner loops
found in the subroutine, the Vectorized Initially column reports the number of inner loops
reported vectorized by the auto-vectorizer in the compiler used, and finally the Vectorized
After Tuning column reports the number of inner loops that were vectorized after tuning.
In the performance tuning work that was done, they were able to achieve significant
speedups and to vectorize all 50 loops reported. There were several issues that caused the
auto-vectorizer to fail to vectorize the loops, though the most frequent was a report that
the loop may have a vector dependence. This appearance of a potential vector dependence
when in reality there is none is exactly the case where Vector Seeker provides the most clear
feedback. In this case proper scoping was needed to clarify the results but as expected,
Vector Seeker showed that all the loops have vector potential.
73
7.4.2 NS BS
The NS BC routine as it originally stood was a 1479-line subroutine that computes several
possible different boundary conditions. The complexity of this code in the original form is
such that simply the flow control with no body would still be several pages long, which is why
work had not yet been done on optimizing the sequential performance of this subroutine.
As with the replication work Vector Seeker was run on the NS BC subroutine. We summa-
rize the results of our tests in Table 7.4, since the details would be as unmanageable as the
original subroutine. We report here on lines rather then instructions, since it is a more clear
comparison to the actual code. We also limit our report to lines in NS BC and do not report
on line that are in code called by the subroutine. We report first on the total number of
lines that were examined as working on vector memory. This is much smaller than the total
but that is unsurprising since we are only looking at a single boundary condition here rather
then all that can be computed by the subroutine. First we present the most conservative
choice from a correctness point of view, which is Single Vector. This is the number of lines
that could be executed as a single vector instruction. The next three categories contain three
different ways of considering large vectors; for the purpose of this experiment, we consider
vectors of more then 127 elements large. These are ordered from the most conservative to
the most optimistic. Minimum requires that all vectors be large, Average requires that the
average vector be large and finally Maximum simply requires that there be a large vector. In
all cases, all instructions examined in the line must meet these criteria. The results showed
that there is a great deal of potential vector performance in this subroutine as measured by
Vector Seeker.
With these results in hand, work was undertaken to extract the vector performance in this
code as well as to make more general performance optimizations. The details of this work
are described in Larson [18]. Since this work included optimizations and transformations
that were unrelated to the vectorization of the subroutine, we will provide a brief description
74
Lines
Examined 495
Single Vector 220
Minimum Large Vector 129
Average Large Vector 186
Maximum Large Vector 345
Table 7.4: Vector Seeker Summary
for the optimizations that relate to vector performance.
The first step in optimizing NS BC was to extract a version of the code with the specific
boundary condition, SAT FAR FIELD, that covered the area of interest for the XPACC project.
With that completed we end up with a triply nested loop that contains nine basic blocks.
The optimizations were able to vectorize six of these blocks while the other three were not
vectorized. The net result was to achieve a speedup of 2.0. This speedup required that there
be extensive data copying; if this copying were avoided, a speedup of 3.98 would be possible.
Though this was not the end result of the optimizations on this subroutine, it shows the
promises that exists for finding vector potential in complex code.
75
Chapter 8
Conclusions
We believe that using dynamic information such as traces is the best way to find the limits of
vectorization. This allows us to extract an optimistic view of what can be found using vector
hardware. We intend to use this information to vectorize real codes and to understand what
language features enable vectorization the most.
In this work we have described programing tools that can immediately help vector experts
to better guide their work. Going forward, we believe this work has provided a research basis
to help guide compiler development. From these basic ideas we can branch out in several
different ways. Interactive compilers could be developed using trace information to guide
the programmer. Similarly, the optimistic dynamic ideas used in this work could be used for
more general parallel exploration. Finally, the code tool Vector Seeker is a real contribution
that has been published on GitHub [9].
76
Bibliography
[1] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Principles, Techniques, and Tools.
Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA, 1986.
[2] A. S. B. Ballance. Asc enews quarterly newsletter june 2014. 2014.
[3] D. Beazley. PLY (Python Lex-Yacc). http://www.dabeaz.com/ply/.
[4] C. Bienia. Benchmarking Modern Multiprocessors. PhD thesis, Princeton University,
January 2011.
[5] D. J. Bodony. Scattering of an entropy disturbance into sound by a symmetric thin
body. Physics of Fluids (1994-present), 21(9):096101, 2009.
[6] J. M. Bull, F. Reid, and N. McDonnell. A microbenchmark suite for openmp tasks. In
Proceedings of the 8th International Conference on OpenMP in a Heterogeneous World,
IWOMP’12, pages 271–274, Berlin, Heidelberg, 2012. Springer-Verlag.
[7] D. Callahan, J. Dongarra, and D. Levine. Vectorizing compilers: a test suite and results.
In Supercomputing, Supercomputing ’88, pages 98–105. IEEE Computer Society Press,
1988.
[8] G. C. Evans. Vector Seeker. https://github.com/gcevans/VectorSeeker.
[9] G. C. Evans. Vector seeker. https://github.com/gcevans/VectorSeeker, 2016.
[10] J. E. Fritts, F. W. Steiling, J. A. Tucek, and W. Wolf. Mediabench ii video: Expediting
the next generation of video systems research. Microprocess. Microsyst., 33(4):301–318,
June 2009.
[11] M. A. Heroux, D. W. Doerfler, P. S. Crozier, J. M. Willenbring, H. C. Edwards,
A. Williams, M. Rajan, E. R. Keiter, H. K. Thornquist, and R. W. Numrich. Im-
proving performance via mini-applications. Sandia National Laboratories, Tech. Rep.
SAND2009-5574, 3, 2009.
77
[12] J. Holewinski, R. Ramamurthi, M. Ravishankar, N. Fauzia, L.-N. Pouchet, A. Roun-
tev, and P. Sadayappan. Dynamic trace-based analysis of vectorization potential of
applications. In Proceedings of the 33rd ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’12, pages 371–382, New York, NY, USA,
2012. ACM.
[13] D. kai Chen. Maxpar: An execution driven simulator for studying parallel systems.
Technical report, 1989.
[14] K. Kennedy and J. R. Allen. Optimizing Compilers for Modern Architectures: A
Dependence-based Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA,
USA, 2002.
[15] J. Kim, D. J. Bodony, and J. B. Freund. Adjoint-based control of loud events in a
turbulent jet. Journal of Fluid Mechanics, 741:28–59, 2014.
[16] D. J. Kuck, R. H. Kuhn, D. A. Padua, B. Leasure, and M. Wolfe. Dependence graphs and
compiler optimizations. In Proceedings of the 8th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages, POPL ’81, pages 207–218, New York, NY,
USA, 1981. ACM.
[17] M. Kumar. Measuring parallelism in computation-intensive scientific/engineering ap-
plications. Computers, IEEE Transactions on, 37(9):1088–1098, 1988.
[18] J. L. Larson. On the single core optimization of plascomcm subroutine ns bc: An
experience report. Technical report, XPACC, 2016.
[19] C. Lattner and V. Adve. Llvm: A compilation framework for lifelong program analysis &
transformation. In Code Generation and Optimization, 2004. CGO 2004. International
Symposium on, pages 75–86. IEEE, 2004.
[20] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi,
and K. Hazelwood. Pin: Building customized program analysis tools with dynamic in-
strumentation. In Proceedings of the 2005 ACM SIGPLAN Conference on Programming
Language Design and Implementation, PLDI ’05, pages 190–200, New York, NY, USA,
2005. ACM.
[21] S. Maleki, Y. Gao, M. Garzaran, T. Wong, and D. Padua. An evaluation of vector-
izing compilers. In Parallel Architectures and Compilation Techniques (PACT), 2011
International Conference on, pages 372–382, 2011.
[22] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery. Numerical Recipes
3rd Edition: The Art of Scientific Computing. Cambridge University Press, New York,
NY, USA, 3 edition, 2007.
78
[23] M. M. Sucheendran, D. J. Bodony, and P. H. Geubelle. Coupled structural-acoustic
response of a duct-mounted elastic plate with grazing flow. AIAA journal, 52(1):178–
194, 2013.
[24] W. Zhang. Performance analysis and optimization of a cfd application. Master’s thesis,
University of Illinois, 2015.
79
