On the Representation of Partially Specified Implementations and its
  Application to the Optimization of Linear Algebra Kernels on GPU by Beaugnon, Ulysse et al.
On the Representation of Partially Specified
Implementations and its Application to the
Optimization of Linear Algebra Kernels on GPU
Ulysse Beaugnon
E´cole Normale Supe´rieure and Google
ulysse.beaugnon@ens.fr
Basile Cle´ment
Inria and E´cole Normale Supe´rieure
basile.clement@ens.fr
Nicolas Tollenaere
Inria and E´cole Normale Supe´rieure
nicolas.tollenaere@inria.fr
Albert Cohen
Google
4c0h3n@gmail.com
Abstract
Traditional optimizing compilers rely on rewrite rules to iteratively apply program
transformations. This iterative approach hides optimization opportunities behind
intermediate transformation steps. For instance, vectorization can only be applied
to the innermost loop in a nest: one must first perform a loop interchange before
even considering vectorization of an outer loop.
In contrast, we propose an implementation framework representing programs as
sets of possible implementation decisions. Specifying one decision can have an
impact on others in a bidirectional manner: specifying that a loop must be vec-
torized prevents other loops from being nested inside it; conversely, specifying a
loop as an outer loop will prevent it from being vectorized. These optimization
decisions commute, obviating the pass ordering problem. We present a constraint
programming system to formally define, represent and explore such implementa-
tion spaces. We also propose an exploration strategy combining tree search and
branch-and-bound; the strength and novelty of this strategy reside in an analytical
model of the lower bound on the execution time of a set of possible implementa-
tions.
We showcase our approach on the construction and exploration of an implementa-
tion space for linear algebra kernels running on GPUs. We show this search space
is expressive enough to represent complex decisions that fundamentally change
the structure of the generated code. We also present preliminary results competi-
tive with the performance of native GPU libraries.
1 Introduction
General-purpose multicore processors encounter stiff competition from specialized hardware accel-
erators such as Graphic Processing Units (GPUs). Those accelerators usually offer massive paral-
lelism, providing a raw processing power orders of magnitude above that of general-purpose proces-
sors. In some domains, adequately taking advantage of this parallelism can be game changing; for
instance, the recent success of deep learning methods can largely be attributed to the performance
of GPU-accelerated libraries [21, 12].
Unfortunately, writing or generating efficient code for GPUs is a complex problem. It requires mak-
ing careful decisions such as mapping computations to the appropriate level of parallelism, moving
Preprint. Work in progress.
ar
X
iv
:1
90
4.
03
38
3v
1 
 [c
s.P
L]
  6
 A
pr
 20
19
data across the memory hierarchy and through memory spaces of different sizes and properties, in
addition to the many possible thread-local optimizations. While the set of potential implementation
decisions is usually well known, complex interactions between them make it hard to find the best
global implementation. Taking one decision can alter the low-level structure of the code or change
the pressure on constrained resources (such as available memory or hardware threads), which can
render other decisions invalid or inefficient. These interactions are critical for performance; for
instance, parallelizing too many loops can be harmful if the threads end up doing too few computa-
tions.
The problem of picking a good implementation in an interaction-heavy optimization space is not
unique to GPUs and occurs on any architecture complex enough that exhaustive enumeration is im-
possible. Multiple approaches have thus already been proposed to represent and explore the space
of potential implementations, such as SPIRAL [26] for DSP codes, LGen [29] for basic linear alge-
bra, or LIFT [31] for more general computations expressed using combinators. These approaches
all share a reliance on rewrite rules to iteratively apply transformations to the code. We argue that
rewrite rules make it hard to anticipate possible downstream transformations and to derive profitabil-
ity information from one point in the search space: while the transformations that can be directly
applied are known, further ones may only become available after so-called “enabling” transforma-
tions have been applied. Moreover, transformations may not commute and different sequences of
transformations may lead to the same result. These factors make it hard to find the best implementa-
tion and as a result, the production and the specialization to a given target or problem size of highly
tuned libraries remain a critical challenge.
• We propose an approach to formally define, represent and explore spaces of implementation
candidates with partially instantiated decisions, and we model the interactions between
such decisions as a constraint satisfaction problem.
• Open decisions are listed explicitly and can be taken in any order, offering a well-behaved
optimization space; in particular, the most performance-impacting decisions can be made
first.
• Precise information can be derived from sub-spaces even before all implementation deci-
sions are made; this includes performance estimates and bounds.
• We provide a method and tool to automatically generate efficient code to represent and
manipulate partially specified implementations, starting from a high-level description; this
tool could be applied to other classes of code optimization problems with different imple-
mentation decisions.
• We demonstrate that the approach is expressive enough to represent complex decisions that
fundamentally change the structure of the generated code, including compositions of strip-
mining, loop fusion, loop interchange, unrolling, vectorization, mapping to multiple levels
parallelism, and orchestrating data movements across the memory hierarchy.
• We present preliminary results on the generation of optimized linear algebra kernels for
GPUs, competitive with native libraries.
Section 2 explains how we represent implementation candidates, Section 3 presents the framework
we use to describe potential decisions and their interactions and Section 4 applies this framework
to linear algebra on GPUs. Then Section 5 shows its practical benefits and Section 6 compares our
work with other approaches.
2 Partially Specified Implementations
Our first goal is to build a representation for implementation spaces that is suitable for search al-
gorithms. We build this representation on the notion of implementation candidate—or candidate
for short—which is a partially specified implementation. A search algorithm works on implemen-
tation candidates, which are further refined until they become fully specified implementations that
can be executed. All candidates refer to a common semantic backbone describing the computation
to implement.
2
2.1 Partially Instantiated Decision Vector
We represent implementation candidates as partially instantiated vector of decisions that specify how
to implement computations. Decisions are variables of a Constraint Satisfaction Problem (CSP).
Each variable has a small set of values it can take, called its domain. A constrained domain contains
a single value.
The objective of a search algorithm is to find an assignment of variables respecting the constraints
by successively restricting the domain of each variable. The decision vector is the product of the
domains and thus a Cartesian over-approximation of the set of implementations. Two domains may
contain legal values that are incompatible with each other because of some constraint.
A search algorithm typically builds a tree whose nodes are candidates. The root is the fully un-
constrained candidate. From there, each node is a strictly more constrained version of its parent.
Fully constrained nodes have no children and correspond to actual implementations for which the
algorithm can generate code and measure execution time.
After restricting a domain, constraint propagators must be run to remove incompatible values from
other domains. In the general case, some incompatible values may not be detected before domains
are further restricted, thus requiring to backtrack. Section 5.2 shows that the constraints we use are
simple enough that this is rarely needed in practice compared to usual backtracking algorithms.
This approach offers two main advantages. First, domains define a well behaved set of actions the
algorithm can take to generate implementations. In particular, the set of available actions decreases
when descending in the tree, and the order in which actions are selected does not matter. This allows
the algorithm to make the most performance-impacting decisions first.
Second, we can define functions on candidates to drive the search. For instance, we can compute
an estimate of the execution time or a probability distribution on the decisions to take next. These
functions can derive relevant information about the sets of potential implementations described by
the domains instead of reasoning on a single implementation at a time.
2.2 Constraint Satisfaction Problem
The decision vector alone is not enough to generate code. It does not specify the semantics of the
code to generate nor how to instantiate decisions for different kernels. This information is present
in a separate structure describing the computation to implement. We call this structure a semantic
backbone; it is mostly constant through the search, although some decisions such as introducing
temporary variables can interact in a limited way with the backbone. Decisions can be understood
as defining properties on the objects in the backbone: for instance, if the backbone defines a loop, it
induces a decision for unrolling the loop.
Let us now present the formalism to derive a CSP from a backbone instance. It describes generic
choices and constraints to instantiate on each semantic backbone to define the decision vector. For
that, we view the backbone as a set of objects that respect different properties. For example, prop-
erties can be x is an instruction or x depends on y. While objects vary from one kernel to the other,
properties remain the same. They allow to describe the implementation space independently of the
backbone instance.
Formally, let Ω0 be a set of identifiers representing basic objects. We define the set of objects Ω as
the set of nested tuples of basic objects: for n > 0, we recursively define Ωn+1 = Ωn ∪Ω∗n where Ω∗n
is the set of tuples of arbitrary length, and Ω = ∪n∈NΩn. Note that this definition implies Ωk ⊆ Ω
for all k.
Let Π be a set of identifiers called properties; a : Π → N denotes their arity. A backbone instance
is a function X mapping properties to the objects satisfying it:
X : Π → P (Ω)
p 7→ X(p) ⊆ Ωa(p)
Finally, let D be a set of domains, where each domain is a (distinct) set of identifiers. A generic
choice is a tuple c : ((p0, . . . pk−1),D) ∈ Π∗ × D. It indicates a decision among the values of D
for each tuple of objects verifying the properties. For instance, the decision could be how each loop
3
should be implemented (regular loop, unrolled loop, parallel loop, etc.) or scheduling information
for pairs of statements (recall that objects can be tuples of basic objects).
We define an implementation space as the combination of a set C ⊆ Π∗ × D of generic choices,
and a set of first-order formulae called constraints. The alphabet for the constraints contains one
predicate symbol with arity a(p) for each property p ∈ Π, and one function symbol with arity
a(p0) × · · · × a(pk−1) for each choice c = ((p0, . . . , pk−1),D) ∈ C.
From there, a partially instanciated decision for a choice c = ((p0, . . . , pk−1),D) is a function:
χc : X(p0) × · · · × X(pk−1) 7→ P (D) .
The decision is partial in the sense that it restricts the possible values, but does not necessarily select
one. A partially instantiated decision is invalid if its target domain contains the empty set, and is
fully instanciated if its target domain only contains singletons.
For a given implementation space and a backbone X , a decision vector or implementation candi-
date (candidate for short) is a vector with one partially instantiated decision for each choice. An
implementation is an implementation candidate where
• all decisions are fully instantiated;
• all the implementation space constraints hold when interpreting the predicate symbol for
property p with X(p) and the function symbol for choice c with χc .
This formalism allows us to dynamically extend the implementation space during search. Extend-
ing the set of objects respecting properties adds decisions and constraints without invalidating pre-
existing ones. This is useful to lower specific constructs upon reaching decisions opening for further
refinement of the implementation choices. For example, allowing to insert the appropriate load and
store instructions after deciding to allocate a temporary array in memory. Lowerings may only add
new objects or add properties to existing ones.
It is important to note that such lowerings only concern the CSP solver. We pre-compute all possible
lowerings ahead of time for search heuristics. Lowerings allow to delay introducing variables and
avoid cluttering the CSP with decisions and constraints that may not affect the generated code.
The same properties can always be encoded with regular constraints instead, ensuring this does not
contradict our claim that all potential decisions can be available upfront.
3 Implementation Space Description
We now present the details of the search space definition framework following the formalism from
Section 2.2. It exposes a high-level language to define the different properties exposed by the se-
mantic backbone (Section 3.1), the decisions (Section 3.2) and the constraints (Section 3.3). It also
specifies when to add new objects to the semantic backbone (Section 3.4). A compiler reads this
language and generates code to create, represent and manipulate the partially instantiated vector of
decisions. In particular, it generates code to propagate the constraints, inserts new objects in the
backbone when needed and extend the CSP with the corresponding new choices and constraints.
This framework allowed us to experiment with multiple encodings for linear algebra kernels on
GPUs with limited effort. The same framework can easily be reused for different application do-
mains or hardware targets.
3.1 Interface With the Kernel Representation
The first step to specify an implementation space is to define sets of objects exposed by the semantic
backbone. Each set corresponds to a property p ∈ Π and is instantiated into X(p) for each kernel
instance X . Definitions indicates both the relation of the set to others and how to generate code
that manipulates its objects. This way, we can easily apply our approach to kernel representations.
Exposing a new property requires less than 10 lines of code.
For example, Listing 1 defines a set Instructions that contains objects of type Instruction
in the host language. The programmer specifies fields to indicate how to iterate on the set and to
retrieve its objects.
4
set Instructions:
type = "Instruction"
iterator = ...
item_getter = ...
... // Elided for brevity
end
set MemAccesses subsetof Instructions:
type = "MemAccess"
... // Elided for brevity
end
set AccessedRegions($inst in MemAccess ):
... // Elided for brevity
end
Listing 1: Set Definitions
choice enum cache($inst in MemAccesses ):
value L1:
// Use L1+L2 caches
value L2: // Use L2 cache
value READ_ONLY: // Use read -only cache
value NONE:
// Do not use caches
end
Listing 2: Choice Definition for Cache Directives
choice enum order($lhs in Instructions ,
$rhs in Instructions ):
value BEFORE:
value AFTER:
antisymmetric:
BEFORE -> AFTER
end
Listing 3: Choice Definition for Ordering Decisions
require forall $a in Instructions:
forall $b in Instructions:
forall $c in Instructions:
order($a, $c) is AFTER
|| order($a, $b) is not AFTER
|| order($b, $c) is not AFTER
Listing 4: Order Transitivity Constraint
choice counter local_mem ():
forall $mem block in MemBlocks:
sum "$mem block.size()" when
mem_space($mem block) is LOCAL
end
require local_mem () < "gpu.local_mem_size"
Listing 5: Local Memory Usage Counter
choice enum fused($lhs in Dimensions ,
$rhs in Dimensions ):
... // Elided
end
quotient ThreadDims of $dim in Dimensions:
is_thread_dim = dim_kind($dim) is THREAD
/ fused is TRUE
... // Elided
end
choice counter num_threads ():
forall $dim in Dimensions:
sum size($dim) when:
is_thread_dim($dim) is TRUE
end
Listing 6: Thread Dimensions Quotient Set
5
Programmers can express inclusion and disjointness relations between sets. They allow to declare a
set structure that matches the subtyping relation in the host language. For instance, MemAccesses
is a subset of Instructions and contains objects of type MemAccess, a subtype of Instruction.
Last we can parametrize sets with another set. For example, Listing 1 defines an AccessedRegions
set for each memory access. This translates to a binary property containing tuples of
AccessedRegions x MemAccess in our formalism.
3.2 Choice Definition
Next we define the set of generic choices C. Each c ∈ C has a list of sets corresponding to properties
p0, . . . pk−1 ∈ Π and a domain Dc ∈ D. When instantiated for a particular kernel X , the choice
defines a function χc from tuples of distinct objects in X(p0) × X(pk−1) to Dc .
For example, Listing 2 defines a function cache from memory accesses to cache directives:
cache : MemAccesses→ {L1, L2, READ ONLY, NONE}
This defines a CSP variable for each memory access exposed by the semantic backbone. The goal
of the implementation space exploration is then to assign a value v ∈ Dc to χc(x0, . . . xn−1) for each
(x0, . . . xn−1) ∈ X(p0) × · · · × X (pn−1) with x0, . . . xn−1 distinct. In our example, this is assigning a
cache directive to each memory access.
Another example, Listing 3 defines a generic choice order from pairs of instructions to orderings.
It first illustrates why we only consider tuples of distinct objects: it is the behavior expected by
the programmer when defining choices. The order between an instruction and itself would not
make sense. It also illustrates how we define antisymmetric choices, allowing the code generator to
generate stronger propagators and only store half of the domains of order variables.
We currently support two kinds of domains:
• enum choices, such as cache in Listing 2, that can take a small set of predefined values and
• integer choices, that take values in a small universe specific to each instance of the choice.
The choice definition includes a piece of code that retrieves the universe from the kernel
representation.
Our tool automatically generates types to represent domains and code to instantiate the generic
choices for a semantic backbone instance.
3.3 Constraints
Next, we express constraints on variables domains to avoid invalid implementations. We use con-
straints to:
• avoid incoherent decisions, e.g., non-transitive ordering of instructions;
• enforce correctness constraints imposed by the semantic backbone, e.g., data dependencies;
• and respect hardware-specific limitations, e.g., the size of the memory blocks allocated to
a memory space cannot exceed the size of that memory space.
Constraints are first order logic sentences on the choices, quantified over objects in specific sets.
Currently, they are universally quantified disjunctions of conditions on zero, one or two choices:
boolean constants, restrictions of a variable to specific values or comparisons between two variables.
For example, Listing 4 ensure the order choice defined in Listing 3 is transitive with a constraint
applied to all triples of distinct instructions a, b and c.
Constraints implicitly assume quantified objects are distinct. This makes it easier to write concise
constraints and matches the intended behavior in most cases. Our tool automatically inserts clauses
to enforce this in the generated code.
Constants can be pieces of code parametrized by the objects. This enables us to parametrize con-
straints at a fine grain without having to declare new sets. In particular, it allows to:
• parametrize bounds on integer decisions
6
size($mem) <= "$mem.max_size"
• selectively disable constraints for some objects:
"$inst.is_vectorizable" || <constraint >
Our tool generates code to propagate constraints. It restricts domains to remove incompatible deci-
sions at the initialization of the implementation space and after restricting any domain. The simple
form of our constraints allow to easily generate efficient propagation code. For fixed object vari-
ables, a constraint only reference a few choices. This limits the amount of propagation to perform
after each update.
Often, hardware limitations impose constraints on sums or products of quantities, such as the sum
of the memory region sizes (to ensure they fit in memory) or the product of the size of nested thread
dimensions (to ensure we do not exceed the maximal number of threads). Simple constraints cannot
express such values. Instead, one can define counters that track sums or products of values. For
example, Listing 5 keeps track of the local memory usage.
While remaining simple, our constraints are expressive enough to encode correct implementations.
Because the propagation code is automatically generated, it is easy to experiment with and encode
a wide range of optimization decisions.
3.4 Kernel Representation Lowering
Finally, we define triggers to lower backbone constructs when certain conditions are met. Triggers
run a callback that either adds properties to existing objects or creates new objects. For example, it
can add the property is stored in memory to a variable, creating new decisions to specify the memory
layout. At the same time, the trigger can also add a store and a load instruction to refine placement
decisions in the memory hierarcy.
Callbacks may only add objects to properties, not remove them so that previous variables and con-
straints remain valid. Moreover, we require that callbacks commute. It is important to understand
that triggers are not mandatory: one can add objects to properties upfront and condition decisions
and constraints with the lowering condition. In our example, cache directives would be forced to
None in candidates where memory accesses do not exist. The reasons to use triggers are:
• to avoid cluttering the decision vector with decisions that will not affect the generated code,
making constraint propagation more efficient;
• and intentionally delay some decisions to focus the search algorithms on more meaningful
choices first.
As with constraints, we define lowering conditions as first order sentences, with universally quanti-
fied object variables. This allows definitions to be independent of backbone instances.
Triggers can be used to define quotient sets: we often end-up needing to account for classes of
equivalent objects such as fused loops. Quotient sets contain one representative for each class of
objects respecting some condition. They are automatically maintained by our tool using triggers.
For example, in Listing 6 we count the number of threads. This is done with a quotient set containing
one representative for each class of fused iteration dimensions mapped to a hardware thread. This
also define a boolean choice, is_thread_dim indicating which dimensions are the representative
of their class (this is redundant, but helps when writing constraints).
Overall, we built a simple, domain-specific constraint programming framework that facilitates the
construction of search spaces, involving new choices and constraints. The framework also eases
porting our approach to different architectures or application domains. Because we abstract the
kernel representation as sets of objects, the definition is independent of the kernel instance. We
could probably implement our tool on top of a standalone CSP framework; apart from the need for
a domain-specific front-end to provide the above-mentioned kernel genericity, this would involve
several optimizations and customizations to reproduce the search strategy and match the efficiency
of our approach, motivating a custom design. In particular, we use the fact that the CSP definition is
independent of the backbone to avoid storing constraints instances in memory. Instead we statically
generate code that we call for different objects. This reduces the memory footprint of the CSP to the
sole decision vector, making it compact, easy to serialize for logging and fast to copy.
7
4 Partial Implementations for GPUs
This section builds on the framework presented in Section 3 representing partial implementations
for linear algebra kernels on GPU. It serves both as an illustration of our approach and as an effective
way to generate competitive code on GPU.
Since this representation targets dense linear algebra, we assume all loops are for-loops with data-
independent bounds (i.e., loop bounds only depend on the input size) and affine memory accesses to
multidimensional tensors. We focus on optimizations that work well on GPUs. Other representations
can be developped with similar ideas using the framework presented in Section 3.
We highlight the set and choice definitions and omit counters and constraints for the sake of brevity.
We highlight the main difficulties and original aspects of code generation, deferring more systematic
coverage to a later, dedicated paper.
4.1 Statements
We represent computations to perform as a list of instructions to execute. Each instruction performs
a single scalar operation: an arithmetic operation (addition, multiplication, cast, . . . ) or a memory
access (load or store). We then specify how many time instructions are executed by nesting them into
iteration dimensions. Iteration dimensions are akin to counted for-loops but may be implemented
differently. Depending on decisions, they can be unrolled, vectorized or parallelized.
Instructions Dimensions
a = load A[i_m] {dm}
b = load B[i_n] {dn}
c = a * b {dm, dn}
store C[i_m * n + i_n] <- c {dm, dn}
Table 1: Outer Product of Two Vectors Kernel
Table 1 shows an example of kernel that computes the outer product of two vectors A and B of sizes
m and n and stores the result in a matrix C of size m × n. This kernel has two iteration dimensions
dm and dn of respective size m and n and current index im and in. Note that this representation does
not imply a particular order of instructions or nesting of dimensions.
Listing 7 shows the definition of two sets listing the instructions and the dimensions. Together, they
form the statements. The order choice dictates the control flow structure. It simultaneously encodes
statements sequential ordering (BEFORE and AFTER), nesting (INNER and OUTER), and loops fusion
(MERGED). In particular, it can:
• move code inside or outside of loops, for example by setting order($stmt, $dim) to
BEFORE or INNER,
• interchange loops, by setting their order to INNER or OUTER.
• fuse loops by setting their order to MERGED.
• schedule loops and instructions by setting the order between two statements to BEFORE or
AFTER.
The order choice demonstrates the first benefit of our approach: the compiler is not limited to a
set of predefined high-level transformations. It does not even have to be aware of them. Instead
we expose many smaller decisions (here, the pairwise ordering between statements). High level
transformations are implemented as specific combinations of decisions. We are free to pick any
other assignment that respects the constraints. This approach makes it easier to understand the
interaction between transformations (as decisions) and to combine them since they are all exposed
in the same framework.
Listing 7 also defines a quotient set IterationDims that contains the iteration dimensions nested
outside each instruction. At the creation of the domain, it specifies the nesting imposed by the back-
bone. Afterward, new dimensions may be added when order gets constrained. It is grouped into
8
set Statements: ...
set Insts subsetof Statements: ...
set Dimensions subsetof Statements: ...
choice enum order($lhs in Statements ,
$rhs in Statements ):
// $lhs is executed before $rhs
value BEFORE:
// $rhs is executed before $lhs
value AFTER:
// $lrhs is nested inside $rhs
value INNER:
// $rhs is nested inside $lhs
value OUTER:
// $lhs and $rhs are fused dimensions
value MERGED:
antisymmetric:
BEFORE -> AFTER
INNER -> OUTER
end
quotient IterationDims($inst in Insts)
of $dim Dimensions:
is_outer_dim = order($dim, $inst) is
OUTER / order is MERGED
...
end
Listing 7: Control-Flow Related Defintions
equivalence classes by the order is MERGED relation so that fused dimensions are only accounted
for once.
# Either d_n and ‘b‘ are nested inside d_m
for i in 0..M: # d_m
a = load A[i]
for j in 0..n: # d_n
b = load B[j]
c = a * b
store C[i*M+j] <- c
# or d_m and ‘a‘ are nested inside d_n.
for j in 0..N: # d_n
b = load B[j]
for i in 0..N: # d_m
a = load A[i]
c = a * b
store C[i*M+j] <- c
Listing 8: Nestings for the Outer Product Kernel
Together with order, IterationDims demonstrates another benefit of our approach: the structure
of the code does not needs to be coherent before the candidate is fully constrained. For example, in
Table 1, load A is nested in dm, load B in dn and c = a ∗ b in dm and dn. As shown in Listing 8,
such a loop structure is impossible: either dn and load B are nested inside dm or dm and load A
inside dn. Here it is legal because ordering decisions are still open. We can delay choosing the
nesting while a regular control flow representation would need to pick one.
4.2 Iteration Dimensions
The dim_kind choice, shown in Listing 9, specifies how to implement dimensions. They can be
parallel (at the BLOCK or THREAD levels), fully unrolled or imply vector instructions.
We usually strip-mine dimensions of the original code into multiple ones when creating the search
space. This allows both to apply different dim_kind decisions to the resulting dimensions and to
tile computations to improve locality.
9
choice enum dim_kind($dim in Dimensions ):
// $dim is a regular for loop.
value LOOP:
// $dim is a hardware block dimension.
value BLOCK:
// $dim is a hardware thread dimension.
value THREAD:
// $dim is totally unrolled.
value UNROLL:
// $dim is vectorized.
value VECTOR:
end
Listing 9: Encoding of Dimension Decisons
set StaticDims subsetof Dimensions: ...
choice integer size($dim in StaticDims ):
"$dim.possible_sizes ()"
end
set LogicalDims: ...
set TilingDims($logical in LogicalDims)
subsetof StaticDims: ...
set TiledDim($logical in LogicalDims)
subsetof Dimensions: ...
Listing 10: Encoding of Strip-Mining and Tiling
Listing 10 shows how we encode strip-mining. We split dimensions in two categories: StaticDims
that have a statically known size, specified by the size choice, and dynamic dimensions (not ex-
posed in a set) whose size depend on input parameters. Logical dimensions represent dimensions
of the original code. They are formed of static dimensions, listed in TilingDims($logical_dim)
and zero or one dynamic dimensions, listed in TiledDim($logical_dim), depending on whether
the logical dimension has a static or dynamic size.
4.3 Memory Accesses
Listing 11 shows how we encode cache directives and memory placement. MemRegions are distinct
pieces of memory in which we allocate arrays. While memory regions holding input arrays are
always in RAM, others regions may also be placed in shared memory, local to a group of threads.
set MemRegions: ...
set MemInsts subsetof Insts: ...
set AccessedRegions($mem in MemInsts ): ...
choice enum mem_space($mem in MemRegions ):
// $mem is stored in RAM.
value GLOBAL:
// $mem is stored in the memory shared
// by a group of threads.
value SHARED:
end
choice enum cache($inst in MemInsts ): ...
Listing 11: Encoding Memory Placement Decisions
MemInsts lists loads and stores and AccessedRegions the memory regions accessed by such in-
structions. cache indicates which caches to use (see Listing 2 for the full definition).
10
4.4 Instruction Operands
Instruction operands are either constants, kernel inputs, induction variables or values produced by
instructions. Induction variables are linear combination of dimension indexes. We use them to
compute memory accesses addresses. We handle them separately from regular instructions as it is
clear how they should be implemented, and exposing them in the search space would only increase
complexity.
By default, operands can only take the last value produced by preceding instructions. To implement
reductions, an instruction operand can also take the value produced by the same instruction at the
previous iteration on a given set of dimensions. In that case, we specify another instruction to
initialize the reduction.
Additionally, we can specify point-to-point communication between dimensions: the value pro-
duced at iteration i of a dimension is consumed at iteration i of another dimension. For example, in
Listing 12, the instruction nested in d0_b and d1_b reads the values produced in d0_a and d1_a.
# Point -to -point communication
for i in 0..M: # d0_a
for j in 0..N: # d1_a
x = load A[i][j]
for i in 0..M: # d0_b
for j in 0..N: # d1_b
# y = 2 * A[i][j]
y = 4 * x[d0_a -> d0_b , d1_a -> d1_b]
# Possible implementation.
for i in 0..M: # fused d0_a and d0_b
for j in 0..N: # d1_a
x = load A[i][j]
store TMP[j] <- x
for j in 0..M: # d1_b
x2 = load TMP[j]
y = 4 * x2
Listing 12: Point-to-Point Communication Between Dimensions
Point-to-point communications allows putting each instruction in its own loop nest. This is useful
as it gives us more flexibility to schedule, fuse and implement dimensions. Point-to-point commu-
nication is either implemented by:
• fusing both dimensions;
• unrolling or vectorizing both dimensions, with the data stored in different registers for each
iteration;
• mapping the two dimensions to the same hardware thread dimension, as explained in Sec-
tion 4.5;
• or using a temporary array.
A trigger automatically creates a temporary array and the associated load and store when other
options are impossible. In particular, this allows copying data to a temporary array in a faster
memory for improving locality. For example Listing 12 shows a possible implementation that stores
chunks of A in an array TMP.
4.5 Thread Mapping
GPUs expose two levels of parallelism: threads and blocks of threads. Thread dimensions define a
block of threads that share a fast memory and can synchronise. Block dimensions replicate blocks
in parallel. Each block or thread dimension can be mapped to one of three hardware dimensions,
hereafter referred to as levels to distinguish them from iteration dimensions.
Blocks cannot easily communicate so block dimensions are outermost parallel dimensions that span
the entire computation. On the other hand, synchronization within a thread block is critical to
achieve good performance. We encode synchronisation barriers by creating multiple loop nests of
11
thread dimensions that maps to the same hardware level. At code generation, we fuse the loop
nests and insert a synchronisation instruction between them. In particular, this allows point-to-point
communications between thread dimensions mapped to the same level.
choice enum thread_level($x in StaticDims ,
$y in StaticDims ):
// maps $x and $y to the same level.
value MAPPED:
// maps $x to an inner level than $y.
value INNER:
// maps $y to an inner level than $x.
value OUTER:
// $x or $y is not a thread dimension.
value NOT_THREADS:
antisymmetric: INNER -> OUTER
end
quotient HwThreadDims of $d in StaticDims:
is_thread_dim = dim_kind($d)
/ thread_level is MAPPED
...
end
Listing 13: Thread Dimensions Mapping
The thread-mapping choice in Listing 13 specifies how thread dimensions map to hardware levels.
The nesting order of thread dimension is crucial as it determines memory coalescing. The number
of threads may vary between thread nests: they might not map to the same levels. In that case, we
use predicated instruction to disable some threads.
5 Experiments
Let us now evaluate our approach on concrete search and code generation problems. We imple-
mented a search strategy that combines a statistical component with a performance model of imple-
mentation candidates (Section 5.1). Our goal is to show that:
• we can generate code competitive with reference hand-tuned libraries (Section 5.2);
• the formalism and performance model allows to extract pertinent information before spec-
ifying decisions (Section 5.4);
• and we improve search performance by making decisions commute (Section 5.5).
All experiments are run on a Linux machine equipped with a 12-core Xeon E5-2620v2 processor,
64GB of RAM and a Quadro K4000 GPU Kepler GPU running under CUDA 8.0.
5.1 Search Strategy
The search space exploration is driven by a Monte-Carlo Tree Search (MCTS) algorithm. We use a
variant of Threshold Ascent on Graph (TAG), which was previously applied to Spiral [13], a code
generator for fast Fourier transforms.
In our case, the algorithm builds a tree whose nodes are candidates. It starts with only the root node,
representing the full search space. It iteratively selects a leaf to expand by using the TAG formula and
evaluation statistics from previous iterations. The leaf expansion creates one child per possible value
of the decision. Then, the MCTS performs a Monte-Carlo simulation to set remaining decisions. It
runs the resulting implementation on the GPU and adjusts the statistics along the selected path with
the execution time. The order in which decisions are taken is fixed upfront. We manually select an
order that specifies most important decisions first. Section 5.5 discusses the impact of the order on
exploration performance.
We complement the TAG algorithm with a performance model of the candidates [2]. The model
provides a lower bound on the execution time of all implementations derivable from a candidate.
We use the bound in two ways:
12
• In the selection phase, we ignore children which have a lower bound higher than the exe-
cution time of the current best implementation.
• During Monte-Carlo simulations, when choosing amongst candidates (Xi)i , we pick Xi
with probability:
p(Xi) ∼ max(T − b(Xi), 0)
whereT is the execution time of the best implementation so far and b(Xi) is the lower bound
given by the performance model to the candidate Xi . Note that p(Xi) = 0 when b(Xi) > T .
In both cases, the performance model prunes regions of the search space which cannot possibly
improve on the current best implementation found. We show in Section 5.4 that this can eliminate
large portions of the search space.
Since this paper focuses on the definition, representation and exploration of the optimization space,
we defer deeper treatment and analysis of the seartch strategy—involving MCTS and performance
modeling—to a later paper.
5.2 Generated Code Performance
We first show the code we generate compares to hand-tuned reference implementations. We created
implementation spaces for a few kernels and compare the execution time of the best implementation
found by our exploration strategy with the reference implementation.
We created all the implementation spaces using a similar procedure. Every instruction is placed in
its own loop nest, with point to point communication between them. We also select a few strip-
mining factors dividing the input size, for each dimension. The search algorithm is free to reorder,
fuse, unroll or vectorize loops or to map them to the different levels of parallelism. It can implement
point-to-point communications using registers or by allocating temporary arrays in shared memory,
and chooses the level of cache to use for each memory accesses.
We consider the following kernels. Unless specified otherwise, the reference implementation calls
CuBLAS, Nvidia’s hand-tuned implementation of basic linear algebra kernels. Matrices are column
major order.
axpy : computes z = α.x + y, where α is a scalar and x, y and z vectors of size n = 226. We
strip-mine n twice, with factors in J2, 4K and J2, 1024K.
matmul : computes C = A · B when A and B are matrices of respective size m × k and k × n. We
strip-mine m and n twice, with factors in J2, 32K and J2, 4K. We try different values for m, n
and k to show how the algorithm adapts. In the rest of the paper we refer to them as matmul
m × n × k.
strided matmul : is the same a matmul 1024 × 1024 × 1024 but with consecutive elements of A
stored with stride of 32. CuBLAS does not support such strides. The reference is a naive
implementation that computes one element of C per thread.
Kernel Space Size Dead Ends Avg. Runtime Reference Speedup
axpy 1.1e11 ± 1.7e10 0.1% ± 0.02 7.05ms ± 0.005 10.3ms 1.47 ±10−3
matmul 256 × 256 × 32 1.83e21 ± 3.3e20 14% ± 0.7 34.2µs ± 2.54 82.8µs 2.42 ±0.18
matmul 1024 × 1024 × 1024 3.5e21 ± 1.8e21 2.8% ± 0.3 4.81ms ± 0.06 3.75ms 0.78 ±0.01
strided matmul 6.0e20 ± 2.0e20 2.3% ± 0.3 10.1ms ± 0.59 637ms 66.7 ±3.9
Table 2: Implementation Space Exploration Results
The benchmarks are representative of both compute-intensive (mm) and bandwidth-bound (axpy)
kernels. They also span a variety of memory access patterns, including strides at different dimen-
sions, transposed layouts, all of these being typical of higher dimensional tensor algebra in compu-
tational chemistry, simulation codes, and machine learning [1].
Table 2 summarises the characteristics of implementation spaces and explorations results. We ran 4
explorations of 4 hours on each kernel. For each exploration, we evaluated the best implementation
40 times. The 95% confidence interval on the average speedup was always within ±0.5%. We
13
report the average runtime and speedup among explorations, along with the maximum variation of
the speedup compared to the average among exploration.
The Space Size and Dead Ends columns respectively provide estimations of the size of the search
space (see Section 5.3) and of the probability to encounter a dead-end when randomly walking the
tree from the root with a uniform sampling among the valid children of each node. Dead ends occur
when constraints propagation is unable to detect incompatible decisions before more choices are
specified. Both columns also indicate the 95% confidence intervals.
For all experiments, the ratio of dead-ends is inferior to a third. This shows that finding valid
implementations is easy. We only use the CSP formalism to encode the search space and not to
search for valid implementations.
axpy and matmul 256×256×32 show that our approach is able to outperform hand-tuned implemen-
tations by finding better implementation decisions. matmul 1024 × 1024 × 1024 compares against
an implementation that almost reaches peak performance. We have no chance of beating it as it
relies on features we do not support. In particular, it uses texture memory and manual allocation of
registers to avoid bank conflicts [22], which is impossible with public Nvidia APIs. However, we
still achieve reasonable performance. strided matmul shows the benefits of our approach for kernels
not available in hand tuned libraries, with a 66× speedup over a naive implementation. While not
reaching peak performance, it is within a factor 2.7 of CuBLAS non-strided version and thus can be
useful.
The high variance on the execution time of the best implementation among explorations on the
same space is a real issue. We are working on developing better search algorithms and refining
the performance model to mitigate the problem. This is to put in perspective with the large size of
implementation spaces and with the fact that this paper is not about search algorithms themselves
but about how to expose the implementation space to them.
5.3 Implementation Space Size Estimation
Computing the exact size of the search spaces is intractable due to the large number of possible
choices. To estimate their size, we turn to probabilistic methods which have been used to reliably
estimate the size of search trees in constraint solvers [19].
The precise method we use was described by Chen [7], which is a generalization of an earlier method
by Knuth [20]. The Knuth algorithm starts at the root and performs a random descent until reaching
a leaf. It then estimates the size of the tree by using the branching factor along the path. The Chen
algorithm, called heuristic sampling, adds the concept of strata: classes of subtrees estimated by
the algorithm designer to be structurally similar. The algorithm maintains a queue containing strata,
represented by a single subtree in the stratum, along with an estimate its size. When a subtree is
discovered, the corresponding stratum in the queue is updated with its count, and it can randomly be
selected as the new representative for the stratum. The total size estimate is then the sum of estimates
for the leaves encountered. A partial order on the strata which is strictly decreasing along the tree
is required to ensure well-formedness. This algorithm was chosen as it strikes a balance between
simplicity and performance. We use a lexicographic pair of the depth in the tree and number of
remaining choices (assuming none get forced through propagation—this provides a simple proxy
for the subtree size to guide the algorithm) as a stratifier.
We computed the results in Tables 2 and 3 with either 1000 iterations of the Chen method or 100, 000
iterations of the Knuth method, whichever gave a better confidence interval. Each time, we report
the 95% confidence interval. For some of the larger implementation spaces, we perform additional
iterations to bring the confidence interval to the correct order of magnitude.
5.4 Discriminant Information in Candidates
We use the size estimates to justify that our representation allows extracting pertinent information
early, with only a few decisions set. Table 3 reports the estimated size of the search space after
pruning the first 10 levels with the performance model. We cut branches whose lower bound was
higher than the average runtime obtained from the experiments of Section 5.2. Note that we use the
execution time without the kernel launch time (obtained using performance counters) instead of the
14
one reported in Table 2 for pruning. We also computed through exhaustive enumeration the number
of nodes at depth 10 with and without cut.
These experiments show that we are able to eliminate a large portions of search spaces early on. For
most kernels, we reduce the size of the space by two or three orders of magnitudes by cutting on the
first 10 levels.
Kernel Space Size
Space Size
with cut applied
on first 10 levels
Nodes at depth 10 Nodes at depth 10with cut applied
axpy 1.1e11 ± 1.7e10 2.9e7 ± 1.1e7 86, 587 121 (0.14%)
matmul 256 × 256 × 32 1.83e21 ± 3.3e20 3.5e18 ± 2.2e18 134, 060 20, 126 (15.0%)
matmul 1024 × 1024 × 1024 3.5e21 ± 1.8e21 9.5e17 ± 4.6e17 161, 980 6, 738 (4.2%)
strided matmul 6.0e20 ± 2.0e20 8.1e17 ± 4.5e17 142,780 9, 512 (6.62%)
Table 3: Implementation Space Exploration Results
5.5 Impact of the Decisions Order
One core feature of our approach is that decisions commute. Here we show how this helps search
algorithms. Experiments in Sections 5.2 and 5.4 first specify memory layout decisions, then size,
then dim kind, then thread mapping, then mem space, then order and last cache. We selected
this order to prioritise what we think are most important decisions. This allows both the performance
model and the MCTS to discriminate higher in the tree and to focus on a fewer branches. This does
not impact implementations, only the structure of the search tree.
We ran experiments on matmul 1024× 1024× 1024 to compare this order with the reverse order. In
both case, we computed the ratio of nodes that the performance model can prune in the first T levels
of the tree, assuming the cut threshold is the average runtime reported in Table 2. The number of
candidates with a depth ≤ d varies greatly with the order of decisions. To have comparable results,
we took for each order the first d such that the number of candidates of depth d is above 105. This
resulted in d = 10 with 1.6e5 for the direct order and d = 16 with 1.1e5 nodes for the reverse.
With the direct order, the performance model reduces the tree size by a factor of 24. This factor
falls down to 1.8 for the reverse order. This is 13 times more branches to consider. We also tried
running the search algorithm using the reverse order but it ran out of memory due to an explosion
in the number of branches. While we illustrated the impact of the decisions order with the perfor-
mance model, it also applies to other algorithms: picking the most discriminant decisions first helps
focusing the search.
6 Discussion and Related Work
Traditional optimizers work on a single implementation that they iteratively improve using rewrite
rules based on pattern matching. In constrast, we propose to work on classes of implementations
modulo optimization. We first discuss how it obviates the problem of optimization ordering and
enable global heuristics aware of all potential optimizations. Then we explain how our system differ
from other encodings of compilation problems in logical frameworks.
6.1 Partial Implementations
Ordering decisions is a common problem in compilers: transformations may enable, disable or even
undo others. Some domain-specific approaches avoid the issue with a carefully curated set of rules.
They then build and explore a tree (or graph) whose nodes are implementations and edges rules
applications. For example, Spiral [26] uses this approach for fast Fourier transforms, LGen [29]
for linear algebra kernels and TCE [1] for large tensor contractions in computational chemistry and
physics simulation. More recently, LIFT [31] applied rules to rewrite a functional representation
down to low-level OpenCL constructs and generate efficient GPU code.
However, these systems still suffer from the original problem: transformations may be hidden behind
others. In contrast, our representation allows to see from the start which decisions are available in
15
which branch and to make most important decisions first. An expert programmer can even manually
set decisions upfront.
An alternative to rewrite rules is to use algorithmic skeletons [10] and to map them to the hardware
using a fixed strategy that leverages domain specific information. In its simplest form, this is just
parametric libraries such as Thrust [3] and SkelCL [30]. Otherwise, it takes the form of high level
functional operators in a domain specific language such as Delite [32], Copperhead [5], Accelerate
[6] or NOVA [11]. While theses systems allow for optimizations, such as the fusion of operators,
they rely on a fixed strategy that makes it hard to adapt to different hardware, different input sizes of
new kind of computations.
The Sea of Nodes approach [9] places instructions outside of basic blocks when possible, effectively
representing classes of implementations modulo scheduling. However, this is limited to scheduling
decision.
6.2 Global Heuristics
The partially instantiated vector offers a complete view of potential implementations. This allows
to define heuristics aware of what a fully optimized implementation look like. The lower bound
performance model mentioned in Section 5 could not work if it just had access to an intermedi-
ate implementation in the compilation process. A similar performance model relying on ad-hoc
partial implementations was previously introduced [2]. We generalize the idea by encoding partial
implementations as a CSP problem on top of a semantic backbone.
Our approach is close to Equality Saturation [33], with similar benefits. Equality Saturation uses
rewrite rules but keeps both the original and rewritten pattern. However the number of patterns can
grow exponentially with the number of rules. In contrast, our decision vector has a fixed size. A
fixed size vector also makes it easier to extract features for machine learning algorithms.
Frameworks that dissociate the algorithm from the schedule, such as Halide [27] for image pro-
cessing and TVM [8] for machine learning arguably also deal with partial implementations. The
algorithm is akin to our semantics backbone and the schedule to our decision vector. However,
they do not have an easy way to reason about partial schedules. TVM applies machine learning
techniques, but only on fully specified implementations. An interesting idea would be to use our
approach on top of their representation to explore schedules. This is also true for other script-based
transformation tools such as UTF [18] or URUK [17] that start from a fully specified implementation
but lift it into a mathematical representation that abstracts the initial schedule.
6.3 Encoding as an Operation Research Problem
The idea of encoding compilation problems in logical frameworks is not new. Polyhedral compila-
tion encodes transformations on loop nests as Integer Linear Programming [15, 16] and PPCG [35]
applies it to generate code for GPUs. Super-optimization techniques also use CSP [23] and SAT
[28] solvers to generate optimal sequences of instructions.
The originality of our approach is to use CSP to expose potential decisions, not to find a solution
(Section 5.2 shows that this is easy in our case). This allows us to manipulate whole sets of potential
implementations, and is a core reason for using CSP: it is easier to guide the search through the
domains, while SAT and ILP solvers often act more like black boxes.
ILP-based approaches maximize a single metric (such as data locality in polyhedral schedules[4])
that does not reflect the full complexity of the architecture. Because we do not try to embed perfor-
mance constraints in the logical framework, we have much more flexibility and can use a combina-
tion of custom heuristics, actual evaluations and statistical search.
One way of solving this problem while staying in an ILP framework is to find schedules for a
single loop level at a time, starting from the outermost [4]. This enables more complex heuristics
and incremental evaluation at each level that goes beyond the expressive power of linear objective
functions [36]. However, loop levels must still be considered in a fixed order. It would be interesting
to use our approach with a similar encoding. Polyhedral compilers have been designed with search
algorithms complementing or replacing linear programming. Pouchet et al. proposed custom genetic
operators on affine schedules to search for dependence-preserving loop nest optimizations [24, 25].
16
Vasilache et al. constructed a two-level optimization strategy where the lower tier is a gray box
based on integer linear programming, exposing a fixed set of strategies and parameters to a higher
tier genetic algorithm [34].
Diesel [14] is a recent framework from Nvidia instantiating an ILP-based scheduler for a specific do-
main, specializing it for each kernel with parameters specific to a GPU micro-architecture (tile sizes,
mapping strategies), complemented with target-specific transformations (e.g., software pipelining,
instruction-level and register-level optimizations). Code generated by Diesel reaches impressive
performance levels, matching or outperforming native libraries. This involves deep knowledge of
the target architecture encoded in the optimization strategy, and register-level optimizations not cur-
rently modeled in our search space.
Overall, our domain-specific language for defining decisions and constraints proved useful for de-
veloping an implementation space for linear algebra, facilitating the design and experiments with
numerous decisions and constraints. We are not aware of any other approach reaching the same
level of automation while remaining competitive with native GPU libraries.
7 Conclusion
We presented a generic approach for program optimization based on partially specified implementa-
tions. In our approach, optimization decisions are available upfront and can be applied in any order,
obviating the problem of optimization ordering. The most performance-impacting decisions can be
made first, and can be specified manually through expert knowledge when available.
At the core of our representation is a decision vector listing all possible decisions for each choice.
Global heuristics can be defined which operate on sets of possible implementations. They can derive
pertinent information, such as profitability estimates to guide the search and performance bounds to
prune the space, long before all decisions have been made.
Starting from a high level description of the implementation choices and their interaction, we built
a tool leveraging constraint programming principles to derive efficient code for the manipulation of
decision vectors. We applied this tool to the construction of a partial implementation search space
for linear algebra kernels running on GPUs, generating code competitive with or outperforming
carefully hand-tuned reference libraries.
So far, we limited ourselves to relatively simple search algorithms with a few remaining hardwired
decisions. For example, our representation allows to search for the best decision order, and we are
working on more complex algorithms that fully exploit this potential. It also enables easily sharing
information between branches of the search tree as choices have fixed positions in the decision
vector, opening up more opportunities to prune the search space.
References
[1] G. Baumgartner, A. Auer, D. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, X. Gao,
R. Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, C. Lam, Q. Lu, M. Nooijen, R. Pitzer,
J. Ramanujam, P. Sadayappan, and A. Sibiryakov. Synthesis of high-performance parallel pro-
grams for a class of ab initio quantum chemistry models. Proceedings of the IEEE, 93(2):276–
292, Jan. 2005.
[2] U. Beaugnon, A. Pouille, M. Pouzet, J. Pienaar, and A. Cohen. Optimization space pruning
without regrets. In Proceedings of the 26th International Conference on Compiler Construc-
tion, CC 2017. ACM, 2017.
[3] N. Bell and J. Hoberock. Thrust: A productivity-oriented library for cuda. In GPU computing
gems Jade edition, pages 359–371. Elsevier, 2011.
[4] U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan. A practical automatic poly-
hedral parallelizer and locality optimizer. In Acm Sigplan Notices, volume 43, pages 101–113.
ACM, 2008.
[5] B. Catanzaro, M. Garland, and K. Keutzer. Copperhead: compiling an embedded data parallel
language. ACM SIGPLAN Notices, 46(8):47–56, 2011.
17
[6] M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell, and V. Grover. Accelerating haskell
array codes with multicore gpus. In Proceedings of the sixth workshop on Declarative aspects
of multicore programming, pages 3–14. ACM, 2011.
[7] P. C. Chen. Heuristic sampling: A method for predicting the performance of tree searching
programs. SIAM Journal on Computing, 21(2):295–315, 1992.
[8] T. Chen, T. Moreau, Z. Jiang, H. Shen, E. Q. Yan, L. Wang, Y. Hu, L. Ceze, C. Guestrin, and
A. Krishnamurthy. TVM: end-to-end optimization stack for deep learning. CoRR, 2018.
[9] C. Click and K. D. Cooper. Combining analyses, combining optimizations. ACM Transactions
on Programming Languages and Systems (TOPLAS), 17(2):181–196, 1995.
[10] M. I. Cole. Algorithmic skeletons: structured management of parallel computation. Pitman
London, 1989.
[11] A. Collins, D. Grewe, V. Grover, S. Lee, and A. Susnea. Nova: A functional language for
data parallelism. In Proceedings of ACM SIGPLAN International Workshop on Libraries,
Languages, and Compilers for Array Programming, page 8. ACM, 2014.
[12] NVIDIA cuDNN GPU accelerated deep learning.
https://developer.nvidia.com/cudnn.
[13] F. de Mesmay, A. Rimmel, Y. Voronenko, and M. Puschel. Bandit-based optimization on
graphs with application to library performance tuning. ICML ’09. ACM, 2009.
[14] V. Elango, N. Rubin, M. Ravishankar, H. Sandanagobalane, and V. Grover. Diesel: Dsl for
linear algebra and neural net computations on gpus. In Proceedings of the 2nd ACM SIGPLAN
International Workshop on Machine Learning and Programming Languages, MAPL 2018,
pages 42–51, New York, NY, USA, 2018. ACM.
[15] P. Feautrier. Some efficient solutions to the affine scheduling problem. i. one-dimensional time.
International journal of parallel programming, 21(5):313–347, 1992.
[16] P. Feautrier. Some efficient solutions to the affine scheduling problem. part ii. multidimensional
time. International journal of parallel programming, 21(6):389–420, 1992.
[17] S. Girbal, N. Vasilache, C. Bastoul, A. Cohen, D. Parello, M. Sigler, and O. Temam. Semi-
automatic composition of loop transformations for deep parallelism and memory hierarchies.
International Journal of Parallel Programming, 34(3):261–317, 2006.
[18] W. Kelly and W. Pugh. A framework for unifying reordering transformations. Technical report,
1998.
[19] P. Kilby, J. Slaney, S. Thie´baux, and T. Walsh. Estimating search tree size. In Proceedings
of the 21st National Conference on Artificial Intelligence - Volume 2, AAAI’06, pages 1014–
1019. AAAI Press, 2006.
[20] D. E. Knuth. Estimating the efficiency of backtrack programs. Mathematics of Computation,
29(129):121–136, 1975.
[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional
neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates,
Inc., 2012.
[22] J. Lai and A. Seznec. Performance upper bound analysis and optimization of sgemm on fermi
and kepler gpus. In Proceedings of the 2013 IEEE/ACM International Symposium on Code
Generation and Optimization (CGO), pages 1–10. IEEE Computer Society, 2013.
[23] P. M. Phothilimthana, T. Jelvis, R. Shah, N. Totla, S. Chasins, and R. Bodik. Chlorophyll:
Synthesis-aided compiler for low-power spatial architectures. In ACM SIGPLAN Notices, vol-
ume 49, pages 396–407. ACM, 2014.
[24] L. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos. Iterative optimization in the polyhedral
model: part ii, multidimensional time. In Proceedings of the ACM SIGPLAN 2008 Conference
on Programming Language Design and Implementation, Tucson, AZ, USA, June 7-13, 2008,
pages 90–100, 2008.
[25] L.-N. Pouchet, U. Bondhugula, C. Bastoul, A. Cohen, J. Ramanujam, P. Sadayappan, and
N. Vasilache. Loop transformations: convexity, pruning and optimization. In ACM SIGPLAN
Notices, volume 46, pages 549–562. ACM, 2011.
18
[26] M. Puschel, J. M. Moura, J. R. Johnson, D. Padua, M. M. Veloso, B. W. Singer, J. Xiong,
F. Franchetti, A. Gacic, Y. Voronenko, et al. SPIRAL: Code generation for DSP transforms.
IEEE, 93(2), 2005.
[27] J. Ragan-Kelley, A. Adams, S. Paris, M. Levoy, S. Amarasinghe, and F. Durand. Decoupling
algorithms from schedules for easy optimization of image processing pipelines. 2012.
[28] A. Solar-Lezama, L. Tancau, R. Bodik, S. Seshia, and V. Saraswat. Combinatorial sketching
for finite programs. ACM Sigplan Notices, 41(11):404–415, 2006.
[29] D. G. Spampinato and M. Pu¨schel. A basic linear algebra compiler. In Annual IEEE/ACM
International Symposium on Code Generation and Optimization (CGO), page 23. ACM, 2014.
[30] M. Steuwer, P. Kegel, and S. Gorlatch. Skelcl-a portable skeleton library for high-level gpu
programming. In Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW),
2011 IEEE International Symposium on, pages 1176–1182. IEEE, 2011.
[31] M. Steuwer, T. Remmelg, and C. Dubach. LIFT: A functional data-parallel IR for high-
performance GPU code generation. In Annual IEEE/ACM International Symposium on Code
Generation and Optimization (CGO). IEEE, 2017.
[32] A. K. Sujeeth, K. J. Brown, H. Lee, T. Rompf, H. Chafi, M. Odersky, and K. Olukotun. Delite:
A compiler architecture for performance-oriented embedded domain-specific languages. ACM
Transactions on Embedded Computing Systems (TECS), 13(4s):134, 2014.
[33] R. Tate, M. Stepp, Z. Tatlock, and S. Lerner. Equality saturation: a new approach to optimiza-
tion. In ACM SIGPLAN Notices, volume 44, pages 264–276. ACM, 2009.
[34] N. Vasilache, O. Zinenko, T. Theodoridis, P. Goyal, Z. DeVito, W. S. Moses, S. Verdoolaege,
A. Adams, and A. Cohen. Tensor comprehensions: Framework-agnostic high-performance
machine learning abstractions. CoRR, abs/1802.04730, 2018, 1802.04730.
[35] S. Verdoolaege, J. Carlos Juega, A. Cohen, J. Ignacio Go´mez, C. Tenllado, and F. Catthoor.
Polyhedral parallel code generation for CUDA. ACM Trans. Archit. Code Optim., 9(4), Jan.
2013.
[36] O. Zinenko, S. Verdoolaege, C. Reddy, J. Shirako, T. Grosser, V. Sarkar, and A. Cohen. Mod-
eling the conflicting demands of parallelism and temporal/spatial locality in affine scheduling.
In Proceedings of the 27th International Conference on Compiler Construction (CC), Vienna,
Austria, Feb. 2018.
19
