Towards scalable pattern-based optimization for dense linear algebra by Berényi, Dániel et al.
Towards scalable pattern-based optimization for
dense linear algebra
Da´niel Bere´nyi*1, Andra´s Leitereg2, and Ga´bor Lehel2
1GPU Lab, Wigner Research Centre for Physics, Budapest, Hungary
2Faculty of Informatics, Eo¨tvo¨s Lora´nd University, Budapest, Hungary
Abstract
Linear algebraic expressions are the essence of many computationally
intensive problems, including scientific simulations and machine learning
applications. However, translating high-level formulations of these expres-
sions to efficient machine-level representations is far from trivial: devel-
opers should be assisted by automatic optimization tools so that they can
focus their attention on high-level problems, rather than low-level details.
The tractability of these optimizations is highly dependent on the choice
of the primitive constructs in terms of which the computations are to
be expressed. In this work we propose to describe operations on multi-
dimensional arrays using a selection of higher-order functions, inspired by
functional programming, and we present rewrite rules for these such that
they can be automatically optimized for modern hierarchical and hetero-
geneous architectures. Using this formalism we systematically construct
and analyse different subdivisions and permutations of the dense matrix
multiplication problem.
1 Introduction
The computational problems we use computers to solve grow ever more
complex and demanding, and we need appropriately powerful tools to
help us tackle them. While the hardware which performs these compu-
tations is advancing at a sufficient rate to keep pace with our increasing
demands, the programming languages and development tools we use to
implement them are not, resulting in increasing burdens being placed on
the shoulders of programmers. A profound example of this can be seen in
the optimization of large linear algebraic expressions that arise in scien-
tific simulations, multi-dimensional data processing (such as image- and
video analysis), and machine learning applications. Naively written algo-
rithms tend to greatly underutilize the capabilities of modern hardware.
The cause of this, at root, is in the hierarchical memory and cache ar-
chitectures and multi-level parallelism that characterizes such hardware.
At a low level, most computations are limited by memory bandwidth:
it is difficult to consistently keep the execution hardware supplied with
new data for it to work on. In many cases, the same data is accessed
repeatedly, and storing it far away from the processor results in excessive
transfers, high latency, and a waste of bandwidth. Unfortunately, fast
memory which is close to the processor is expensive in terms of area and
1
ar
X
iv
:1
80
5.
04
31
9v
1 
 [c
s.D
C]
  1
1 M
ay
 20
18
cooling, and therefore available only in limited quantities. Together, these
considerations give rise to hierarchical cache- and memory layouts where
large memories are slow and fast memories are small, and software must
take care to place data carefully in order to maximise utilization.
Common levels of the hierarchy are the following:
• SIMD (Single Instruction Multiple Data) parallelism, and registers
• Thread level parallelism, and thread-private memory
• Thread-group level parallelism, and thread-local memory (GPU) or
caches (CPU), shared within, but not between groups
• Device level parallelism, and device memory (within an entire CPU
or GPU)
• Node level parallelism, involving multiple devices (CPUs and GPUs)
which communicate via an internal bus (can have shared address
space)
• Cluster level parallelism, involving nodes which communicate via a
network (data must be serialized)
At the lowest level, which is the level of instructions, hardware uses
vector instructions which can process 2, 4, 8, 16, or even more pieces of
data simultaneously. Instructions are executed concurrently in multiple
threads, transferring memory between their private memory and regis-
ters. Modern CPUs often support tens of concurrent threads, and GPUs
support thousands. At the next level up, a single computational node
often contains many separate CPUs and GPUs. Finally, multiple com-
putational nodes are connected together using a network to form clusters
and grids. To efficiently perform a computation, it must be split into pro-
gressively smaller parts at each level of the hierarchy, but one would prefer
not to have to do this splitting manually in each case, because they are in
many regards extremely similar: at a high level of abstraction, splitting,
for example, a summation of many elements between vector instructions
and between nodes of a cluster requires analogous logic, only a separate
implementation.
The traditional approach to linear algebra is BLAS (Basic Linear Alge-
bra Subroutines, [2]), which provides hand-optimized versions of many of
the most common operations, such as scalar-vector, vector-vector, vector-
matrix and matrix-matrix routines, for single and double precision floating
point as well as complex number types. There are two problems with this
approach. The first is that if a required operation is not in the collection,
either due to differing structure, or to operating on a different number
type, then the pre-selected collection of operations provides no tools with
which to build it. The second, that calls to these predefined routines are
not efficiently composable: intermediate results of one operation must be
written out to memory before the next one can use them, resulting in
wasteful transfers. As a complementary approach, much effort has been
spent on optimizing loops in existing imperative code bases, as done, for
example, by the Polly optimizer [3]. Unfortunately, as the structure of
loops is very flexible, it is highly difficult to implement sufficiently power-
ful pattern recognition to find optimization opportunities in hand-written
code. Newer approaches (such as Eigen [4], or Armadillo [5] and other
competing libraries) first construct an expression tree out of the supplied
high-level primitives, analyze it, and only then attempt to fuse operations
within it.
2
A similar situation arises in machine learning, especially when done
using neural networks, where the user constructs high-level graphs de-
scribing the computation out of predefined primitives. For example, the
widely used machine learning framework TensorFlow [6] suffers from the
same forced memory write-out problem as BLAS, and only recently is
an optimizing just-in-time compiler (XLA [7]) being developed which at-
tempts to fuse and optimize operations in the computational graph in
order to eliminate temporaries.
In every case, the root of the problem is twofold: what are the best
primitives to use in the domain specific language in order to describe
the operations the users want to represent? And what are the patterns
(particular combinations of these primitives) which can be automatically
detected and recombined in order to yield a more optimal execution, both
in terms of memory usage as well as parallelism?
Most approaches to solving these problems naturally divide them into
multiple levels: a higher level consisting of (say) algorithmic primitives,
a lower level of hardware-specific building blocks, and sometimes, even
further lower-level representations beyond that [1][8]. In this work, we
primarily focus on the higher level problem of choosing the right prim-
itives in a theoretically justified way and constructing their associated
rewrite rules in such a way that the result is a closed system, rather than
delving into the details of the lower-level primitives, which both differ
at each of the different levels of parallelism mentioned above, and have
also already been extensively studied by others [9][10]. Our sense is that
in the existing approaches, a significant amount of genericity has been
missed in the choice of the high-level building blocks, and that this limits
their scalability.
In this paper, we proceed as follows. First, we provide some examples
of fusion problems from linear algebra and neural networks. Next, we
assemble a collection of primitives which together are flexible enough to
express a wide range of useful operations, and which are highly amenable
to optimization. Further on, we describe rewrite rules which can be au-
tomatically applied to an expression tree consisting of these primitives in
order to fuse or rearrange operations within it. Finally, we demonstrate
on the matrix multiplication problem how these rewrite rules can improve
data locality and performance.
2 Motivating examples
Inflexible libraries centered around providing pre-written, non-parametric
primitives suffer from too many temporaries, for example the expressions:
wi =
∑
j
(Aij +Bij) · (vj + uj), (1)
Cik =
∑
j
Aij ·Bjk · gj , (2)
can be fused into a minimal modifications of the usual matrix-vector and
matrix-matrix multiplication formulas respectively, but one wants to do
such transformations automatically. Obviously the terms can contain
other more composite but still low arithmetic density operations (scalar
multiplications: yi = c · xi, outer products: Aij = xi · yj). One cannot
expect that a single primitive can encompass this variability, so analy-
sis is required on the whole subexpression tree. In the case of neural
3
networks, layers are usually composed of some dense arithmetic transfor-
mation (affine transformation or convolution), a non-linear function and
some normalization in between [12]:
ybk =
∑
i
Wikx
b
i + βk, (3)
zk =
ybk − E[yb]√
V [yb]
, (4)
rk = h(zk) (5)
where eq. 3. is a dense transformation, that can be a convolution too, eq.
4. is batch normalization, where E is the average, V is the variance over
the collection indexed by k, and finally eq. 4 is an element wise applied
non-linearity.
Here, again, it is preferable, to fuse all these steps into a single opera-
tion without temporaries, because the last two parts are of low arithmetic
density. While common cases can be hand written for end users, due to
the fast pace of developments in the field modifications of all these steps
and the invention of new ones pose a challenge to framework creators.
Another important problem is concerned with properly dividing the data
of a computation into the hierarchical memory layout of the target device.
Again, we might strive to do this for arbitrary expressions, rather than
for just pre-written primitives. Consider the earlier example, or a slight
variation of it:
Cik =
∑
j
Aij ·Bjk · gj (6)
Cipq =
∑
jk
Aijk ·Bjp · Ckq · gj · fk (7)
Such formulas pop up in the numerical solution of partial differential equa-
tions, and they directly follow from the structure of the original equation.
The important question here is what the optimal partitioning of the data
in the hierarchical memory is to maximize data reuse (caching) and thus
maximize performance.
Both linear algebra and neural networks deal with multidimensional
data: scientific computations usually deal with 2 or 3 dimensions but
for some applications higher dimensions are also common. Neural net-
works also usually deal with 2 or 3 dimensional data (images or video),
but these have color dimensions also. When processing images or frames
usually multiple samples are grouped together (called batching) that cre-
ates another dimension. Altogether, these languages need primitives that
can abstract over dimensionality, and be parameterizable so future ex-
tensions would not necessarily need new primitives. In the following, we
build a language suitable for expressing multi-dimensional linear algebra
operations, such as vector, matrix or tensor products and later construct
rewrite rules that can optimize the expressions built in this language to
be efficient on modern hardware endowed with hierarchical memory and
parallelism.
4
2.1 Functional DSL for dense linear algebraic ex-
pressions
Most dense operators that are common in linear algebra, image process-
ing or in neural networks are different kinds of repetitive operations on
a composite structure, where the structure might be multidimensional.
The operations can be reductions, dimension keeping transformations, or
enlarging operations. Simplification and fusion of these operations is of
high interest, because these form the innermost and most extensively used
parts of many modern computational solutions, from Big Data to Machine
Learning or large scale simulations. Traditional imperative formulations
(like for loops) are way too general to be suitable for straightforward op-
timization, or require complex and expensive analysis before that. Func-
tional primitives, that are more restricted, however can provide suitable
expressivity, easier automatization and parallelisation possibilities. Here,
we assume that all the dimension, shape and layout information is repre-
sented at the type level (and thus, known at compile time), although we
may omit them to simplify the notation. A multidimensional flat array
type is represented as a(e1,e2,...edim), with the constructor:
data Array0 :: (dim :: Nat)→ (e1, e2, ..., edim)→ ∗ → ∗ (8)
where dim is the number of dimensions, and ek is the extent in each
dimension. Such representations were investigated earlier [13] and Nape-
rian functors were found to be a useful abstraction, which essentially state
that such containers are identical to functions from an index set to the
elements:
f a ' Idx n→ a (9)
An important property is that composition of Naperian functors can be
indexed by product of indices:
f (g a) ' Idx nf × Idx ng → a (10)
and product of Naperian functors can be indexed by the sum of indices:
(f a, g a) ' Idx nf | Idx ng → a (11)
Naperian functors were named after John Napier, the inventor of the
logarithm, due to the close resemblance to the logarithmic identities log(f◦
g) ' log f × log g and log(f × g) ' log f + log g. However, to us a more
important property is that nested Naperian functors can be transposed:
f (g a) ' g (f a) (see [13] for details). In our case we not just simply
use this property for expressing the transposition of functor structures,
but rather to express exchange of subdivisions over such containers as
demonstrated below.
Nested, or subdivided arrays (arrays of arrays) can be represented by
iterating the exponentiation,
data Array ::
(dim :: Nat, depth :: Nat)→ ((e1, e2, ...edim), (f1, ..., fdim), ...depth)→ ∗ → ∗
(12)
for example: a(2,3)(4,5) is a matrix of elements of type a, such that the
outer level is divided into 4 block rows and 5 block columns, and each
5
of these sub blocks is a matrix with 2 rows and 3 columns, so the whole
structure contains (2× 4 · 3× 5) = 120 elements.
However, as we are going to use operations that consume strictly one
(the outermost) dimension of the input structure, it is easier to formulate
the computation if we represent the subdivided arrays as flat multidimen-
sional ones. For this we use strides (step sizes over the data corresponding
to each dimension) denoted as a((e1,s1),(e2,s2),...,(en,sn)). For the above 120
elements (stored in a linear memory) a((3,1),(2,3),(5,6),(4,30)) is a simple 4 di-
mensional tensor in row-major order, but if we interpret the same elements
as a((3,1),(2,15),(5,3),(4,30)), then we get the described subdivided array. As
we are to change the number and the order of the dimension consuming
operations, we need to make corresponding changes in the logical layout
of the data:
• subdiv d b s: creates a new subdivision of structure s by splitting
the extent at dimension d into blocks of size b.
If subdiv d b (M :: a(e0,s0)...(en−1,sn−1)) = M ′ :: a(e
′
0,s
′
0)...(e
′
n,s
′
n) then
(e′i, s
′
i) = (ei, si), (i < d)
(e′d, s
′
d) = (b, sd), where b must be a divisor of ed
(e′d+1, s
′
d+1) = (ed/b, b · sd),
(e′i, s
′
i) = (ei−1, si−1), (i > d+ 1).
• flatten d s: merges the adjacent depths d and d + 1 into a single
layer, it is the inverse of subdiv. If flatten d (M :: a(e0,s0)...(en,sn)) =
M ′ :: a(e
′
0,s
′
0)...(e
′
n−1,s
′
n−1) then
(e′i, s
′
i) = (ei, si), (i < d)
(e′d, s
′
d) = (ed · ed+1, sd),
(e′i−1, s
′
i−1) = (ei+1, si+1), (i > d)
• flip d1 d2 s: flips (transposes) the layout by swapping any two
dimensions (extent and stride together). flip applied twice is identity
and it is commutative in d1 and d2. If we omit the second argument,
its default value is d2 = d1 + 1.
To compare the representation and the functions with [13], we see that
this model can represent arbitrary, but same dimensional subdivision and
collapse of a given multi dimensional array. This follows the intuition of
practical problems, that subdivide the initial structure into sub chunks
while keeping the dimension but divide the extents to improve memory
usage. The extra constraint, that we have between the extents is that they
must properly divide the number of elements in the given direction, and
thus we cannot simply say, that we have a list (composition) of arbitrary
functors in the container type, but we need to express the specific numbers
as list of tuples and introduce the above operations with regards to these
extents and dimensions.
We note that flip really corresponds to the flip for functions:
flip :: (a→ b→ c)→ (b→ a→ c) (13)
due to the Naperian function connection. It seems tempting to associate
subdiv with curry and flatten with uncurry, but we have the extra
constraint of divisibility.
We now list the three most common operations on the above intro-
duced arrays:
map :: (a→ b)→ f a→ f b (14)
6
zip :: (a→ b→ c)→ f a→ f b→ f c (15)
reduce :: (a→ a→ a)→ f a→ a (16)
map and zip (in Haskell: zipWith) are elementwise operations over an
array f: these can represent e.g. scalar multiplication or division of vectors,
matrices and tensors, or colour transformations of pixels of an image,
or in the case of zip elementwise sums, differences. reduce as its name
suggests can reduce a structure into a summary value using a binary
operator: this can be used to calculate sums, products, minimal/maximal
values etc. In contrast to fold, it takes at least 1 element. Another
important property of these operations is that they are straightforward to
parallelize. map and zip are considered to apply their function argument
completely independently for each element, or element pairs. For reduce,
if the binary operation is associative, we can regroup the reduction and if
it is also commutative we can even reorder it. These transformations are
important for optimizing memory access patterns [14].
We can now translate formulas to our DSL, for example, consider the
vector-matrix product:
ui =
∑
j
Aij · vj (17)
u = map(\r→ reduce (+) (zip (∗) r V) A (18)
where A is a row-wise matrix of n elements per row and v is an array
of n scalars. The main property we want to optimize is the number of
times new threads are spawned. Each higher order function may represent
multiple parallel threads, which are costly to spawn. Also, each of these
calls may need temporary memory allocated to store intermediate results.
Minimizing both of these drastically improves performance on modern
hardware.
3 Rewrite rules for higher-order functions
The fundamental idea of the approach presented here is that the possible
rewrites are induced by the structure of types and operations on them, ie.:
given a simple expression of a computation as a combination of the above
primitives, we would like to automatically change it in such a way that
it produces the same results but with different (probably better) runtime
performance and more suitability for parallel execution. This cannot be
purely driven by the types of the arrays, because their relations are ex-
pressed in the higher-order functions operating on them, however types
can express the matching subdivision structures and track rearrangements
and signal potential mistakes in the implementation of the rewrites.
The important question is: can we identify a set of rewrite rules which
together suffice to optimize any possible combination of the above primi-
tives?
We consider two groups of rewrite rules:
1. Cases where the higher-order functions form a pipeline (sequential
composition), with the output of one being the input of the next.
These are be fused by composition operations.
2. Cases where one higher-order function is passed to another as an
argument function for it to apply. These cases describe operations
7
b1 b1b1
b2 b2b2
nzip (f::a1->a2->a3->c)
a1 a1a1
a2 a2 a2
a3 a3 a3
c c c
nzip (f::b1->b2->a2)
=
nzip (f::a1->b1->b2->a3->c)
a1 a1a1
a3 a3 a3
c c c
b1 b1b1
b2 b2b2
Figure 1: Illustration of nzip fusion
on nested structures. We cannot fuse them, but we are interested in
exchanging them implicitly relying on the Naperian functor property.
To present the rewrite rules we use Haskell syntax, but we emphasize
that these rules follow from simple operations on functions (since these
types are Naperian functors) that can be represented in virtually any
language.
In the first group we start with the simplest fusion law. From the
functor property, we know that:
map f . map g = map (f . g) (19)
This allows us to collapse any length of composed maps into a single
call to map. The question is, whether we can do similarly for zip or
zip-map compositions. It turns out that the composition of other zips
before the arguments of a zip is not closed as it can lead to three or more
parameters to the zip, so we need to generalize them to arbitrary number
of arguments. Let us now define the n-ary version of map and zip that we
call nzip:
nzip :: (an → b)→ (f a)n → f b (20)
where the notation an is understood as:
a0 → a1 → a2 → ...→ an (21)
for any finite n. We may reduce the definition if we wish to:
nzip :: (an)→ (f a)n. (22)
Now, if we define the generalised composition:
ncomp :: (i :: Nat)→ (an → c)→ (bm → ai)
→(a0...i−1 → b0 → ...→ bm−1 → ai+1...n−1 → c) (23)
ncomp f g = ... (24)
that composes g before the i-th argument of f (we index arguments from
0), and we can state the general composition rule for nzip as (see also
Figure 1):
8
nzip f xs0...i−1 (nzip g ys) xsi+1...n−1 =
nzip (ncomp i f g) xs0...i−1 ys0...m−1 xsi+1...n−1 (25)
This way nzip (and thus map and zip) is now closed under arbitrary com-
positions. Next we turn to reductions. It seems feasible to separate the
reduction function from all preprocessing/merging done on the arguments,
so we define reduce-of-zips (following the idea of foldMap from the Haskell
library Data.Foldable) for arbitrary number of arguments:
rnz :: (b→ b→ b)→ (an → b)→ (f a)n → b (26)
The first argument is a reduction function, that is assumed to be at least
associative, the second argument is a zipping function that elementwise
zips the final n arguments of rnz. Now, we can consider the problem of
composing arbitrary maps or zips before a reduction and fusing them into
a single reduction, for a map:
rnz r f (nzip g xs0...n−1) = rnz r (ncomp 0 f g) xs0...n−1 (27)
or for a zip:
rnz r f (nzip g xs0...n−1) (nzip h ys0...n−1) =
rnz r (ncomp 0 (ncomp 1 f g) h) xs0...n−1 ys0...n−1 (28)
One important application is that now we can express the dot product of
vectors: c =
∑
i ui · vi in a single call as:
dot u v = reduce (+) (zip (∗) u v) = rnz (+) (∗) u v (29)
In the general case, reductions with different first arguments cannot be
fused. Even if r is the same, the outer map function f should be identity so
that the whole construction expresses the regrouping of a single reduction
and thus can be fused to one.
For completeness, we review some rules concerning products. The
implicit assumption here is that products of types are usually used to
represent a smaller amount of data than arrays, despite that arrays are
just repeated products of the same type. One can express the infamous
Array-of-Structures vs. Structures-of-Arrays optimization pattern as:
Array dim layout (a, b) = (Array dim layout a, Array dim layout b)
(30)
The simplest rule concerning the higher-order functions is
(map f x, map g y) = map (f, g) (x, y) (31)
Where (f, g) is understood as the function product ((∗ ∗ ∗) in Haskell’s
Control.Arrow) :: (a→ b)→ (c→ d)→ (a, c)→ (b, d). If the argument
is the same for both functions, we can make use of the fanOut function
(fanOut :: (a→ b)→ (a→ c)→ a→ (b, c)):
(map f, map g) x = map (fanOut f g) x (32)
Trivially we can extend these definitions to arbitrary number of products,
and also extend them for the other functional primitives:
(zip f x y, zip g p q) = zip (f, g) (x, y) (p, q) (33)
9
(reduce f x, reduce g y) = reduce (f, g) (x, y) (34)
with the generalization of the product to multiple arguments:
(a→ b→ c)→ (x→ y→ z)→ (a, x)→ (b, y)→ (c, z) and assuming that
the functors are all of compatible sizes.
We now turn to cases that are not compositions, but are operating on
nested structures, such that the higher-order-functions are nested in their
function parameters. We are interested in structure changing rules that
can exchange the order of higher-order functions so we can use them to
regroup computations.
The simplest such rewrite is flipping maps. Consider the dyadic prod-
uct:
Aij = vi · uj (35)
this can be written as:
map (\x→ map (\y→ x ∗ y) u) v (36)
or
map (\y→ map (\x→ x ∗ y) v) u (37)
up to a flip in the functor structure: the first result matrix is expressed
as an array of rows, while the second as columns. Note, that the multi-
plication is done in the same order so commutativity is not required for
this transformation to work. It is straightforward to extend this trans-
formation to zips, so this is omitted here. The less trivial transformation
is, that maps and reductions can be flipped in a similar way, consider the
matrix-vector product:
vi =
∑
j
Aij · uj . (38)
If A is a standard flat multidimensional array in row-major order, then
the multiplication can be expressed in the textbook way as:
map (\r→ rnz (+) (∗) r u) A (39)
that can be also written as:
rnz (zip (+)) (\c q→ map (\e→ e ∗ q) c) (flip 0 A) u (40)
(we pulled out the rnz part from the definition of dot). The former is
performing the computation by taking the dot product of each row and the
vector, while the latter corresponds to the multiplication of the columns
of the matrix by elements of the vector and adding them together with the
vector addition. The crucial point here is, that the selected value from the
vector (q) is reused for the whole column, at the cost that more memory
is needed to store the accumulator for the reduction, but more parallelism
can be exploited. Before we can formalize the exchange operation, we
first introduce lift that is the applicative operation raising a function to
operate over functors (shorthand for a partially applied map):
lift :: (a→ b)→ f a→ f b (41)
Of course, we can generalize the above identities to n-ary arguments and
can operate at any level of the functor nestings. The formal version of the
exchange rule used in the matrix-vector product above is (see Figure 2):
map (\a→ rnz r m a u) A =
rnz (lift r) (\a q→ map (\α→ m α q) a) (flip 0 A) u (42)
10
(rows A)
r
r
r
u
dot 
dot 
dot 
map (\r -> dot r u) (rows A)
(cols A)
u
rnz (zip (+)) (\c q -> map (\e ->e*q) c) (cols A) u
q
q
q
q
c c c c
ma
p (
*q
)
c' c' c' c'
zip (+) zip (+)
zip (+)
Figure 2: Two versions of matrix-vector multiplication related by the flip iden-
tity (eq. 42) for map and rnz operations.
This transformation is one of the most important transformations that
can generate cache friendly versions of a given formula. Looking at the
type level, we find, that the matrix a(n,m) is flipped (resulting in a(m,n))
to keep the same dimension binding to the structure of the vector. So
we can conclude, that exchanging two nested higher order functions must
be done with an appropriate flip in the subdivision structure. It is also
possible to state the flip identity generally for rnz operations, but it is
much more restricted, the reduction operation must be the same and
must be commutative and associative:
rnz r (\a1 a2 → rnz r m a1 a2 B) A1 A2 = (43)
rnz r (\a1 a2 b→ rnz r (\α1 α2 → m α1 α2 b) a1 a2) (flip 0 A1) (flip 0 A2) B.
The idea of the proof of this transformation is based on expanding the
rnz operations and observing that all (m α1 α2 b) elements appear exactly
once on both sides and then the equality follows from the properties of r.
These operations are necessary, but not sufficient to bring competitive
performance when compared to hand-written functions. The reason is,
as mentioned earlier, the hierarchical memory layout of all modern com-
puting hardware. For example, in a well written matrix multiplication
code, the matrix is subdivided into smaller submatrices usually as many
times as many levels are in the memory hierarchy: for example, GPUs
have global memory, shared (local) memory and vector registers.
We can construct identities because subdivisions should not alter the
results of operations when compared to the execution without subdivi-
sions. As a trivial example, mapping over a vector can be done by mapping
the same function over the subdivisions, and then this whole operation
over the outer structure, or even over repeated subdivisions:
map f v =
map (\x→ map f x) (subdiv d b v) =
map (\x→ map(\y→ map f y) x) (subdiv d1 b1(subdiv d2 b2 v))
(44)
The same is true for zips and reductions. The important observation is,
that all the actual computation is done in the innermost map, all outer
maps are just ”logical” operations, reshapings. Not only are operations
11
A(1a) = subdiv 1 (1, 2) $ subdiv 0 (4, 1) A
map (\r -> rnz (+) (\b c -> rnz (+) (*) b c) r u') A(1a)
A(1b) = subdiv 1 (4, 1) $ subdiv 0 (1, 2) A
rnz (zip (+)) (\C c -> map (\r -> rnz (+) (*) r c) C) A(1b) u'
A(1c) = subdiv 1 (1, 2) $ subdiv 0 (1, 2) A
rnz (zip (+)) (\C d -> rnz (zip (+)) (\c e -> map (*e) c) C d) A(1c) u'
u'
C
c c
C
cc
d
d
e
e
e
e
map (*e)
zip (+)
zip (+)
rnz (zip (zip (+))) (\C e -> map (\b -> map (*e) b) C) A(2a)u
A(2a) = subdiv 1 (2, 1) $ subdiv 0 (1, 4) A
u
C
b
C
b
e
e
e
e
map(map (*e))
C C C'
b'
C'
b'
C' C'
zip (zip (+))
u' = subdiv 0 2 u c
c
r
r
r
dot b cb b
rnz (+)
r
map (\r -> rnz (zip(+)) (\be -> map (*e) b) r u) A(2b)
u'
r
r
r
C C
c
c dot r c
zip (+)
r
u e
e
e
e
r
r
b b b b b' b' b' b'
map (*e)
zip (+)
A(2b) = subdiv 1 (1, 4) $ subdiv 0 (2, 1) A
map (\R -> map (\r -> rnz (+) (*) r u) R) A(2c)
A(2c) = subdiv 1 (2, 1) $ subdiv 0 (2, 1) A
u
R
R
r
r
r
r
dot r u
Figure 3: Six rearrangements of the matrix-vector multiplication gained by
subdividing (eq. 46) (left three cases) and (eq. 48) (right three cases) and then
applying the map-rnz flipping rule of eq. 42.
independent of the level of subdivisions, but by invoking the earlier iden-
tities, we can rearrange their order.
We now use the above results to regroup the calculation of the matrix-
vector product, where A has the type a(n,m):
vi =
∑
j
Aij · uj (45)
map (\r→ dot r u) A (46)
subdividing the rows r and the vector u into smaller chunks of size b, we
can write:
map (\r→ rnz (+) (\b c→ dot b c) r (subdiv 0 b u)) A′
A
′ = subdiv 1 b A (47)
On the other hand, starting from the other formula:
rnz (zip (+)) (\c q→ map (\e→ e ∗ q) c) (flip 0 A) u (48)
and subdividing the columns of the matrix we find:
rnz (zip (+)) (\C c→ map (\r→ dot r c) C) A′′ (subdiv 0 b u)
A
′′ = subdiv 1 b(flip 0 A) (49)
Comparing the two results, we can see that it is still the manifestation
of the map-reduce flip identity constructed earlier, so subdivisions and
12
flips commute as expected. One can construct other arrangements by
applying the map-rnz flip formula above, here we list all six cases (see
also the Figure 3):
• 1a: map (\r→ rnz (+)(\b c→ rnz (+) (∗) b c) r u′) A(1a)
(eq. 47),
• 1b: rnz (zip (+)) (\C c→ map (\r rnz (+) (∗) r c) A(1b) u′
(applying flip 0 to 1a)
• 1c: rnz (zip (+))(\C d→ rnz (zip (+))(\c e→ map (∗e) c) C d) A(1c) u′
(applying flip 1 to 1b)
• 2a: rnz (zip (zip (+))) (\C e→ map (\b→ map (∗e) b) C) A(2a) u
(applying subdiv to map in eq. 48 ),
• 2b: map (\r→ rnz (zip (+))(\b e→ map (∗e) b) r u) A(2b)
(applying flip 0 to 2a)
• 2c: map (\R→ map (\r→ rnz (+) (∗) r u) R) A(2c)
(applying flip 1 to 2b)
with the A(..)s being matching subdivisions, u′ = subdiv 0 b u.
Here we write out how the different matrices are related to the original
one (c.f. Figure 3):
• A(1a) = subdiv 0 2 A
• A(2a) = subdiv 0 2 (flip 0 A)
• A(1b) = flip 1 (subdiv 0 2 A)
• A(2b) = flip 1 (subdiv 0 2 (flip 0 A))
• A(1c) = flip 0 (flip 1 (subdiv 0 2 A))
• A(2c) = flip 0 (flip 1 (subdiv 0 2 (flip 0 A)))
In cases 1a 1b 1c the vector is subdivided (we divided the rnz opera-
tion that involves the vector), in the other three cases the vector is not
subdivided as we subdivided the map that does not involve the vector.
Comparing the formulas, we once again see that the exchange of the
higher-order functions happens simultaneously with the flipping of the log-
ical structure. Furthermore, the formulations listed using the same letter
(for example, 1a and 2a) differ only in a full (”deep”) transposition of the
logical structure. Each representation provides different parallelization
and caching possibilities, but may also require different memory capaci-
ties for accumulation. While the final performance is affected by many
parameters, a few important points can be observed: If the vector is sub-
divided (1a, b, c), then it is possible to have 2-level caching (with one
block of the vector at an outer level, and one element at an inner level);
while if it is not (2a, b, c), then only a single level of the memory hierarchy
can be used. Cache efficiency is improved when reductions involving the
vector are propagated outwards, thereby reusing the blocks or components
multiple times. This, on the other hand, causes the size of temporaries
used for the reductions to become bigger, for example, 1a uses only scalar
accumulators, while 1b and 1c require full columns. This can form a limit
on how high the reductions can be raised. If the matrix is stored in a row-
major order, then the memory access patterns of the 1c and 2a variants
may be too interleaved, and similarly for 1a and 2c if using column-major
storage. Repeated subdivisions of the same dimension (1c, 2c) provide no
real benefit over the original formula.
13
HoF permutation Time [s]
mapA rnz mapB 0.45
rnz mapA mapB 1.41
mapA mapB rnz 4.67
mapB mapA rnz 6.05
rnz mapB mapA 13.8
mapB rnz mapA 15.6
Table 1: Six rearrangements of the naive matrix-matrix multiplication. The HoF
order from left to right is the nesting from top down in the computation. The
naive C level implementation is 4.9 seconds and the improved blocked version
is 0.278 seconds.
4 Demonstrative results
In this section, we show some brief results to illustrate, that subdivid-
ing and rearranging a computation expressed with the above higher-order
functions can indeed yield better performing versions. For this demon-
stration we consider matrix multiplication for square matrices on CPUs.
We implemented[15] a DSL in C++ that transforms an AST containing
the above introduced HoFs and lambda abstraction / application nodes.
We implemented the pattern match and replace logic by structured recur-
sion schemes (catamorphisms, anamorphisms, paramorphisms and Elgot-
algebras), and defined the pattern-replacement pairs for standard lambda
calculus transformations (η equivalence, β reduction) and the subdivision
and flip operations as defined above. The system can take an AST of a
simple computation as input, can perform subdivision and then can gener-
ate all permutations of HoFs nested under each other as seen below. Since
this kind of nesting forms a list, the well known Steinhaus-Johnson-Trotter
algorithm [16][17] can be used to enumerate all possible permutations by
adjacent element swapping.
Finally, for each permutation we can generate C++14 code for a small
library defining the HoF implementations and Views for accessing the
data. The Views store their dimensions and strides in template parameters
thus implementing the Naperian functor concept (as they have a fixed
shape). Flips and subdivisions are implemented as template functions.
Below all results are for 1024 x 1024 double precision matrices measured
on a Core i5 7300HQ. We also implemented the naive algorithm in C as
well as a block-optimized version by hand, which ran in 4.9 seconds and
0.278 seconds, respectively.
Let us start from the mathematical form:
Cik =
∑
j
AijBjk (50)
The equivalent naive functional formulation is:
C = map (\rA → map (\cB → rnz (+) (∗) rA cB) B) A (51)
where rA are the rows of A, and cB are the columns of B, we assume that
they are both stored in a row major format, so accessing their consecutive
elements have different costs. While this textbook formulation is easy to
understand and implement, its performance is far from optimal due to
inefficient cache usage. Using the exchange rules above, we can create 6
14
HoF permutation Time [ms]
rnz mapA mapB rnz 186
mapA rnz mapB rnz 249
rnz mapA rnz mapB 324
mapA rnz rnz mapB 478
rnz rnz mapA mapB 1468
mapB rnz mapA rnz 2024
rnz mapB mapA rnz 2231
mapA mapB rnz rnz 4970
rnz mapB rnz mapA 5580
mapB mapA rnz rnz 5686
mapB rnz rnz mapA 5972
rnz rnz mapB mapA 7353
Table 2: Twelve rearrangements of the matrix-matrix multiplication, where the
rnz operation is subdivided. The HoF order from left to right is the nesting
from top down in the computation. The naive C level implementation is 4.9
seconds and the improved blocked version is 278 ms.
 0
 2
 4
 6
 8
 10
 0  2000  4000  6000  8000  10000  12000  14000  16000
Co
un
t
Time [ms]
Figure 4: Rearrangements of the matrix-matrix multiplication gained by subdi-
viding the two map operations.
permutations (since each HoF is different) and measure their performance
(Table 1).
It is easy to understand why the best performance is achieved when
mapB is in the innermost position: in this case a flip was used to exchange
the rnz with mapB (compared to the original form) and thus matrix B
is accessed in the inner loop row-wise, in the same direction as the row
of A in the rnz, and such consecutive reads are the best for the memory
controller logic. The opposite order leads to the most suboptimal case,
when B is outside and A is inside: in this case both of them are accessed
column wise, which effectively invalidates most of the caches.
Next, we may subdivide the rnz part, and create all permutations of
the 4 HoFs, however we do not differentiate between the two rnzs, so
effectively we have 12 different cases (Table 2). At the subdivision we use
b = 16 blocksize.
In the other case, when the two maps are subdivided (in their outer-
most direction) does not improve performance: the best candidate there
needs 1 second (c.f. Figure 4). When the rnz is subdivided twice, the
average performance is drastically increased as shown in Figure 5, all can-
didates are at least as good as the naive implementation, but the best
candidate is just as good as in Table 2. Subdividing both the maps and
15
 0
 5
 10
 15
 20
 25
 0  1000  2000  3000  4000  5000  6000  7000  8000
Co
un
t
Time [ms]
Figure 5: Rearrangements of the matrix-matrix multiplication gained by subdi-
viding the rnz operation twice.
 0
 5
 10
 15
 20
 25
 30
 35
 40
 45
 0  1000  2000  3000  4000  5000  6000  7000  8000
Co
un
t
Time [ms]
Figure 6: Rearrangements of the matrix-matrix multiplication gained by subdi-
viding all HoFs once.
the rnz yields no improvement over the case where just the rnz was sub-
divided (Figure 6).
We note, that Eigen with vectorization explicitly disabled (but the
compiler may still vectorize) performs the matrix multiplication in 333
ms, and with its own optimal vectorization it finishes in 60 ms. These
results of course depend not only on the hardware at hand, but also on
the compiler used (in our case clang SVN r323406), and especially in our
compile-time sized case, on the size of the matrix. The important point
is, that with the automatic rewrites it was possible to achieve more than
25 fold increase in performance (4.9 seconds to 180 ms) w.r.t. the naive
implementation.
On GPUs, not all potential rearrangements are feasible. To make
use of the 2 dimensional thread spawning capability, it is useful to select
cases, where the maps over the matrices are adjacent. This is immediately
the case in the naive form, and from the subdivided cases we compare
with the one arising from the simultaneous subdivision of all three HoFs,
namely: mapA mapB rnz mapA mapB rnz. With the necessary additional
considerations of local memory, this leads to a 40% improvement (AMD
Radeon HD7970, ComputeCpp 0.4 compiler).
By applying the presented rewrite rules repeatedly, we get deeper sub-
divisions and further rearrangements of the initial formula, and the ad-
ditional degrees of freedom can be used to utilize vectorization both on
CPUs and GPUs, or over multiple devices and threads, but this is outside
the scope of this paper and left to future work
16
5 Related work
In this work we focused on designing a collection of functional primitives,
which are composable and closed under composition as often as possible,
and suitable for pattern based optimisation to increase data locality. Here,
we review other approaches to utilizing hardware parallelism.
Algorithmic Skeletons or algorithmic patterns are higher-order func-
tions that aim to capture some recurring structures of computation, es-
pecially in the context of parallel programming [18][19][20]. Such a col-
lection of algorithms usually includes divide-and-conquer for recursively
formulated computations, iterative combination for graph problems, and
also pipelining (a sequence of stages, where different stages on different
objects are executed in parallel), farming (a variant of map) and other
algorithms. Most of these skeletons were formed by abstracting common
high-level patterns of problems (even higher-level, than the ones discussed
in this work) and while their formulation is similar to higher-order func-
tions in functional programming, their theoretical background, composi-
tionality and interplay with other language constructs like products and
sums types is not self-evident. Nevertheless, these approaches provide
patterns which we know how to execute optimally on parallel hardware.
Optimization and fusion in the context of function programming us-
ing rewrite-rules is extensively researched for improving languages and
compilers, like Haskell [21], the GHC compiler [22, 23] and Data-Parallel
Haskell[24]. Most of the efforts are centered on eliminating the needless
construction and consumption of lists and streams in memory, which are
frequent in such languages. While fusion of such generic data structures is
important, further optimization is possible in special cases, where fusion
would require additional constraints on the shape of data, as is the case
with arrays and especially arrays of arrays (multidimensional arrays). For
example, a list of lists may be irregular (each contained list may have
a different length, potentially zero), and thus may not be safely trans-
posed (order of nestings exchanged), whereas a fixed size array of fixed
size arrays must be regular, making such a transformation possible. This
limitation was recognized, and fusion for arrays in particular has also
been investigated, as in the case of [25]. SAC[26] is another functional
array library with an emphasis on fusion. In the HPC setting, rewrites of
legacy code based on user annotations are described in [27]. The Spiral
language [28] that targets performance critical problems with extensive
inlining and optimizations is also driven by rewrite rules. Delite [29] also
uses rewrite rules to optimize locality in NUMA code, and targets CUDA
capable GPUs.
GPU Computing is traditionally done through an imperative API
(CUDA, OpenCL, SYCL, OpenACC, OpenMP) and/or an abstraction li-
brary over them, like Copperhead[30] (data parallel constructs embedded
in Python) or Accelerator[31] (parallel arrays with predefined operations
in C#).
From the functional perspective Obsidian[32] and Accelerate[33, 34]
employ techniques similar to those presented here, but none of these in-
volve the optimization of operations on multidimensional arrays. The
NOVA language shares some of the fusion logic presented here [35] and
supports nested structures, but nested fold-map structures are flattened.
Multidimensional and nested parallel constructs are discussed in [36], but
they do not change the layout of the computation, only the mapping of
nested loops to GPU threads. Loop exchange in the context of small lin-
17
ear algebraic problems was studied in [37], but not in the general case of
arbitrary tensors of nested structures.
We mention two projects in more detail as they use very similar
methodology to the one presented here.
Lift [1][11][10] is a DSL for high-level GPU programming, it generates
OpenCL code from high-level functional primitives with rewrite rules. A
restricted set of primitives are provided which can be nested, opening the
possibility of utilizing nested parallelism on the hardware. While the map,
zip and reduce constructs are similar to the ones presented here, they are
not generalized to the variadic case, preferring to express the same thing
using nested applications instead, but this results in a larger number of
potential program rearrangements which need to be assessed. Map sub-
division (which the authors call splitting) is aided by auxiliary functions
(split and join), whereas we are able to nest and transform HoFs directly.
During rewriting, they explore multiple different algorithmic choices (by
applying different rewrite rules) to generate multiple different optimiza-
tions of the input program, which can then be lowered to OpenCL-specific
parallel primitives. Their rewrite rules does not involve the exchange rules
presented here, although some of their features can be expressed using
their reorderstride function. Overall, for linear algebra problems, their
approach may require a larger number of primitives, and exploring a larger
search space, compared to the more generic constructs presented here.
Futhark [38][39] is another purely functional DSL which, similarly to
Lift, targets GPUs. It has a wide range of generic primitives for operating
on arrays which are similar to the ones presented here and come with an
extensive set of fusion rules. The Futhark multi-stage optimising compiler
also uses rewrites to apply optimizations. The fusion rules are selected
to never duplicate computations and never reduce available parallelism.
Futhark uses a data flow graph reduction logic and, because evaluating
fusion of arbitrarily large graphs is NP-hard, they use a greedy algorithm
which selects the first suitable candidate. As for exchange rules, Futhark
performs generic map-loop, map-reduce interchanges and avoids full flat-
tening, but may end up reducing parallelism during these transformations
in an attempt to keep data locality. The exchange and transformation
logic is predefined and specialized for GPUs, and might not be well-suited
for other architectures. In the case where there is more than one candidate
for the most efficient rearrangement, the Futhark compiler generates mul-
tiversioned code and selects between them at runtime. After performing
rearrangements, the kernel extraction pass selects parts of the program
which will be lowered to be executed on the GPU. Shape of arrays is ma-
nipulated by index space transformations and explicit rearrangements in
memory that are inserted by the compiler after it performs complicated
analysis of access patterns. In our approach, we choose the opposite di-
rection, that is, generating different traversal patterns as a consequence of
changes in the logical structure of the data. Tiling in Futhark is likewise
depends on access analysis, and is restricted to one or two dimensions.
Whereas in our approach, tiling also falls out as a consequence of rear-
ranging HoFs and the logical structure of the arrays they act upon and
does not have any inherent limitations in the number of dimensions it
supports. In the SGEMM benchmark of Futhark, the authors note that
further improvements would be possible by using register based tiling in
addition to local memory tiling. Both of these may be expressed in our
approach by mapping different layers of the nested HoF structure to dif-
18
ferent memory spaces, as in our discussion of the memory hierarchy in the
introduction.
6 Conclusions
We have investigated a collection of rewrite rules for the fusion, subdi-
vision and rearrangement of higher-order functions expressing dense ar-
ray arithmetic, which are relevant to the optimization of linear algebra
and neural network constructs. Our focus was specifically on the best
choice of high-level primitives, with an eye to compositionality and clo-
sure under transformations. We achieved our aim by generalizing function
composition, products and higher-order functions to an arbitrary number
of arguments. By using Naperian functors, we were able to take sim-
ple and obvious identities on functions and translate them for use with
generic multidimensional array types, constructing subdivision and flip
operations for higher-order functions operating on them. As the driving
goal of our work is to help develop tools which can automatically improve
the hierarchical memory- and cache-efficiency as well as the parallelism
of user-written programs, these transformations are derived entirely from
the structure of the expression tree representing the computation writ-
ten by the user. Thus, this approach might be called structure-induced
parallelism. The rewrite rules we have expressed here are scalable to struc-
tures which are nested and subdivided to an arbitrary extent, and thus
are potentially capable of distributing computations over the entire hier-
archy of modern hardware, from vector instructions to entire clusters. On
the example of the matrix multiplication problem, we demonstrated how
the automatic subdivision and reordering of higher-order functions can
improve data locality by transforming an initial naive implementation.
Future work The approach to enumeration presented in this paper is
limited to linear nesting of higher-order functions. In order for a gener-
alization of this to enumeration of trees to be useful in practice, an early
cut rule is also necessary to prune rearrangements which are not feasible,
or which are not significantly different from previously enumerated cases.
The enumerator as well as the early cut rule may depend on considera-
tions which are problem- or platform specific; however, a solution can be
based on top of the building blocks we have presented here.
There is an important set of problems not yet covered in the pre-
sented HoF formulation: the so-called sliding window problems, such as
convolutions, finite difference schemes, and stencil operations on images.
The interaction of these with products, compositions, and reductions, and
their rearrangement and subdivision, still needs to be addressed.
Acknowledgments
D. B. is supported by the Hungarian National Bureau for Research, De-
velopment and Innovation, NKFIH No. K120660 and K123815. A. L. is
supported by the U´NKP-17-2 New National Excellence Program of the
Ministry of Human Capacities. The authors are also thankful for the
Wigner GPU Lab, especially Ma´te´ Ferenc Nagy-Egri and Bala´zs Kac-
skovics for IT support and access to computation resources.
19
References
[1] The Lift Project
http://www.lift-project.org/index.html
[2] Basic Linear Algebra Subprograms
http://www.netlib.org/blas/
[3] T. Grosser, A. Groesslinger, C. Lengauer
Polly - Performing polyhedral optimizations on a low-level intermedi-
ate representation
Parallel Processing Letters, (2012).
[4] Gae¨l Guennebaud, Benoˆıt Jacob and others
Eigen, http://eigen.tuxfamily.org, (2010).
[5] Armadillo: a template-based C++ library for linear algebra
Journal of Open Source Software, Vol. 1, pp. 26, (2016).
[6] Mart´ın Abadi et al
TensorFlow: A System for Large-Scale Machine Learning
12th USENIX Symposium on Operating Systems Design and Imple-
mentation (OSDI 16), pp 265-283, (2016).
[7] Google, XLA - Accelerated Linear Algebra
https://www.tensorflow.org/performance/xla/
[8] M. Steuwer, C. Fensch, S. Lindley and C. Dubach
Generating Performance Portable Code Using Rewrite Rules: From
High-level Functional Expressions to High-performance OpenCL Code
SIGPLAN Not. Vol 50. No. 9. (2015).
[9] T. Remmelg, T. Lutz, M. Steuwer and C. Dubach
Performance Portable GPU Code Generation for Matrix Multiplica-
tion
Proceedings of the 9th Annual Workshop on General Purpose Pro-
cessing Using Graphics Processing Unit, pp. 22-31, (2016).
[10] M. Steuwer, T. Remmelg and C. Dubache
Lift: A Functional Data-parallel IR for High-performance GPU Code
Generation
Proceedings of the 2017 International Symposium on Code Generation
and Optimization, pp. 74-85, (2017).
[11] M. Steuwer
Improving Programmability and Performance Portability
PhD Thesis, University of Mu¨nster (2015).
[12] S. Ioffe and C. Szegedy
Batch Normalization: Accelerating Deep Network Training by Reduc-
ing Internal Covariate Shift
Proceedings of the 32nd International Conference on International
Conference on Machine Learning, Vol. 37., pp. 448-456, (2015).
[13] J. Gibbons
APLicative Programming with Naperian Functors
European Symposium on Programming, 10201. pp. 568-583, (2017).
20
[14] M. Harris and G.E. Blelloch and B.M. Maggs et al
Optimizing parallel reduction in CUDA
Proc. of ACM SIGMOD. Vol. 21 No. 13. pp. 104-110., (2007).
[15] DataView. Author’s repository, https://github.com/leanil/DataView
[16] H. F. Trotter
Algorithm 115: Perm
Commun. ACM. Vol. 5. No. 8. pp. 434-435., (1962).
[17] M. J. Selmer
Generation of Permutations by Adjacent Transposition
Mathematics of Computation, American Mathematical Society, Vol.
17. No. 83. pp, 282-285., (1963).
[18] M. I. Cole
Algorithmic Skeletons: Structured Management of Parallel Computa-
tion
Pitman/MIT Press, (1989).
[19] J. Darlington, M. Ghanem and H. W. To
Structured Parallel Programming
In Programming Models for Massively Parallel Computers, pp. 160-
169., (1993).
[20] J. Darlington, A.J. Field, P.G. Harrison et al
Parallel Programming Using Skeleton Functions
Lecture Notes in Computer Science, Vol. 694. Springer, pp. 146-160,
(1993).
[21] D. Coutts, R. Leshchinskiy and D. Stewart
Stream Fusion: From Lists to Streams to Nothing at All
SIGPLAN Not. Vol. 42. No. 9. pp. 315-326., (2007).
[22] B. Lippmeier, M. M. T. Chakravarty, G. Keller and A. Robinson
Data Flow Fusion with Series Expressions in Haskell
SIGPLAN Not. Vol. 48. No. 12. pp. 93-104., (2013).
[23] S. P. Jones, A. Tolmach and T. Hoare
Playing by the rules: rewriting as a practical optimisation technique
in GHC
Haskell Workshop’01 (2001).
[24] M. M. T. Chakravarty, R. Leshchinskiy, S. P. JONES, G. Keller and
S. Marlow
Data Parallel Haskell: A Status Report
Int. Work. on Decl. Aspects of Multicore Prog, pp. 10-18. (2007).
[25] M. M. T. Chakravarty and G. Keller
Functional Array Fusion
SIGPLAN Not. Vol. 36. No. 10. pp. 205-216., (2001).
[26] C. Grelck and S-B. Scholz
SAC - A Functional Array Language for Efficient Multi-threaded Ex-
ecution
International Journal of Parallel Programming, Vol. 34, No. 4, (2006).
21
[27] A. Panyala, D. Chavarria-Miranda and S. Krishnamoorthy
On the use of term rewriting for performance optimization of legacy
HPC applications.
41st International Conference on Parallel Processing (ICPP), pp. 399-
409., (2012).
[28] M. Pu¨schel, J. M. F. Moura, J. Johnson, et al
SPIRAL: Code generation for DSP transforms
IEEE special issue on Program Generation, Optimization and Adap-
tation, 93(2), (2005).
[29] K. J. Brown, H. Lee, T. Rompf, et al
Have Abstraction and Eat Performance Too: Optimized Heteroge-
neous Computing with Parallel Patterns
Proceedings of the 2016 International Symposium on Code Generation
and Optimization ACM, pp. 194-205., (2016).
[30] B. Catanzaro, M. Garland and K. Keutzer
Copperhead: Compiling an Embedded Data Parallel Language
Procs. of ACM Symp. on Principles and Practice of Parallel Program-
ming, pp. 47-56., (2011).
[31] D. Tarditi, S. Puri and J. Oglesby
Accelerator: Using Data Parallelism to Program GPUs for General-
Purpose Uses
Tech. rep. pp. 11. (2006).
[32] K. L. Claessen, M. Sheeran and J. B. Svesson
Expressive Array Constructs in an Embedded GPU Kernel Program-
ming Language
Work. on Decl. Aspects of Multicore Prog. pp. 21-30., (2012).
[33] M. M. Chakravarty, G. Keller, S. Lee, T. L. McDonell and V. Grover
Accelerating Haskell array codes with multicore GPUs
DAMP. ACM, (2011).
[34] T. L. McDonell, M. M. Chakravarty, G. Keller, and B. Lippmeier
Optimising purely functional GPU programs.
ICFP. ACM, SIGPLAN Not. Vol. 48. No. 9. pp. 49-60., (2013).
[35] A. Collins, D. Grewe, V. Grover, S. Lee and A. Susnea
NOVA: A Functional Language for Data Parallelism
Proceedings of ACM SIGPLAN International Workshop on Libraries,
Languages, and Compilers for Array Programming, No. 8. pp. 8-13.,
(2014).
[36] H. J. Lee, K. J. Brown, A. K. Sujeeth, T. Rompf and K. Olukotun
Locality-Aware Mapping of Nested Parallel Patterns on GPUs
Proceedings of the 47th Annual IEEE/ACM International Symposium
on Microarchitecture, pp. 63-74, (2014).
[37] D. G. Spampinato and M. Pu¨schel
A Basic Linear Algebra Compiler
Proceedings of Annual IEEE/ACM International Symposium on Code
Generation and Optimization, No. 23. pp. 23-32., (2014).
[38] T. Henriksen
Design and Implementation of the Futhark Programming Language,
PhD thesis, (2017).
22
[39] T. Henriksen, N. G. W. Serup, E. Martin, H. Fritz and C. E. Oancea
Futhark: Purely Functional GPU-Programming with Nested Paral-
lelism and In-Place Array Updates
Proceedings of ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation (2017).
23
