Spin Summations: A High-Performance Perspective by Springer, Paul et al.
ASpin Summations: A High-Performance Perspective
Paul Springer, AICES, RWTH Aachen
Devin Matthews, ICES, UT Austin
Paolo Bientinesi, AICES, RWTH Aachen
Besides tensor contractions, one of the most pronounced computational bottlenecks in the non-orthogonally
spin-adapted forms of the quantum chemistry methods CCSDT and CCSDTQ, and their approximate
forms—including CCSD(T) and CCSDT(Q)—are spin summations. At a first sight, spin summations are
operations similar to tensor transpositions; a closer look instead reveals additional challenges to high-
performance calculations, including temporal locality as well as scattered memory accesses. This publication
explores a sequence of algorithmic solutions for spin summations, each exploiting individual properties of
either the underlying hardware (e.g. caches, vectorization), or the problem itself (e.g. factorizability). The
final algorithm combines the advantages of all the solutions, while avoiding their drawbacks; this algorithm,
achieves high-performance through parallelization, vectorization, and by exploiting the temporal locality
inherent to spin summations. Combined, these optimizations result in speedups between 2.4× and 5.5×
over the NCC quantum chemistry software package. In addition to such a performance boost, our algorithm
can perform the spin summations in-place, thus reducing the memory footprint by 2× over an out-of-place
variant.
CCS Concepts: •Mathematics of computing → Mathematical software; •Software and its engi-
neering → Source code generation; •Computing methodologies → Linear algebra algorithms;
Special-purpose algebraic systems; •Theory of computation→ Vector / streaming algorithms;
Additional Key Words and Phrases: high-performance computing, tensor transposition, in-place
ACM Reference Format:
Paul Springer, Devin Matthews and Paolo Bientinesi. 2017. Spin Summations: A High-Performance Per-
spective ACM Trans. Math. Softw. V, N, Article A ( 2017), 22 pages.
DOI: http://dx.doi.org/10.1145/0000000.0000000
1. INTRODUCTION
In quantum chemistry, especially in high-accuracy calculations utilizing the popular
coupled cluster family of methods [Cˇı´zˇek 1966; Bartlett and Musiał 2007; Helgaker
et al. 2013], tensor algebra plays a central role and often accounts for the vast major-
ity of the computational effort. The main tensor operation utilized is a tensor contrac-
tion, the multidimensional analogue of matrix multiplication, which is indeed usually
expressed in terms of highly-optimized matrix multiplication kernels [Springer and
Bientinesi 2016a; Matthews 2016].
Other essential tensor operations can consume a significant portion of the runtime,
especially when they are poorly optimized relative to hand-tuned matrix multiplica-
tion kernels, or when they are practically bandwidth-bound. Tensor transpositions
(permutations) [Wei and Mellor-Crummey 2014; Lyakh 2015; Springer et al. 2016;
Author’s address: P. Springer (springer@aices.rwth-aachen.de), P. Bientinesi (pauldj@aices.rwth-
aachen.de), AICES, RWTH Aachen University, Schinkelstr. 2, 52062 Aachen, Germany, D. Matthews
(dmatthews@utexas.edu), ICES, The University of Texas at Austin, Austin, TX 78712, USA.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights for components of this work owned
by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request
permissions from permissions@acm.org.
c© 2017 Copyright held by the owner/author(s). Publication rights licensed to ACM. 0098-3500/2017/-ARTA
$15.00
DOI: http://dx.doi.org/10.1145/0000000.0000000
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
ar
X
iv
:1
70
5.
06
66
1v
1 
 [c
s.M
S]
  1
8 M
ay
 20
17
A:2 Paul Springer et al.
Springer and Bientinesi 2016b] represent one of the bottlenecks, but recent work has
shown that this cost can be minimized [Springer et al. 2017; Springer et al. 2016;
Springer and Bientinesi 2016b] or even eliminated entirely [Matthews 2016; Springer
and Bientinesi 2016a]. Instead, in high-accuracy calculations based on the efficient
non-orthogonal spin-adaptation scheme [Cˇı´zˇek 1966; Kucharski and Bartlett 1992;
Matthews et al. 2013], there remains an additional and significant bottleneck in the
form of the “spin summation” operation [Matthews and Stanton 2015]. This operation
is critical to forming the intermediates necessary to implement the fully factored form
of the working equations, and also in the regularization of the tensor quantities to
maintain numerical stability in the face of a linearly-dependent basis [Kucharski and
Bartlett 1992; Adams and Paldus 1979]. The optimization of this important kernel is
the focus of this work. In particular, we examine the set of spin summation opera-
tions summarized in Sec. 5.1, which cover all of the operations required for the non-
orthogonally spin-adapted CCSDT [Noga and Bartlett 1987] and CCSDTQ [Kucharski
and Bartlett 1992] quantum chemistry methods, as well as their approximate forms,
such as CCSDT(Q) [Bomble et al. 2005].
Tensor transposition is related to spin summation in that the non-locality (in both
space and time) of data references must be taken into account in order to achieve
high efficiency on modern hardware. The primary difference between transposition
and spin summation, though, is that spin summation entails data reuse of both the in-
put and output tensors, while transposition accesses the tensors only once. For factor-
izable summations (defined below), there is also a difference between algorithms that
perform optimal and sub-optimal numbers of floating-point operations. Thus, while
specific optimization techniques from tensor transposition may be applied to the spin
summation problem, a bottom-up analysis of algorithmic efficiency for this distinct
class of problems is warranted.
The techniques and algorithms developed in this work are applicable to other im-
portant tensor operations beyond spin summations. A spin summation is essentially
a linear sum of multiple tensor transpositions; this pattern appears in other tensor
operations as well, for example in tensor (anti-)symmetrization [Hanrath and Engels-
Putzka 2010; Lai et al. 2012], and in the aggregation of intermediate tensor products
produced in distinct orderings [Baumgartner et al. 2005; Hartono et al. 2009]. Op-
timization of operations involving multiple transpositions is especially important for
higher-dimensional tensors due to the proliferation of possible transpose variants.
To the authors’ best knowledge, this work represents the first discussion of spin
summations from a high-performance perspective. The contributions of this publica-
tions can be summarized as follows:
— We provide a detailed analysis of different algorithmic variants for the calculation of
spin summations, exploiting various features of modern CPUs.
— We present a high-performance open-source implementation1 for 3D and 4D spin
summations used in computational chemistry, in particular in the non-orthogonally
spin-adapted CCSDT and CCSDTQ methods and their various approximations. The
algorithm is parallelized and vectorized, it exploits the CPU’s cache hierarchy,
and it reduces the amount of required floating-point operations while only using
small buffers that fit into cache; together, these optimizations result in significant
speedups (ranging from 2.4× to 5.5×) over the reference NCC quantum chemistry
software [Matthews and Stanton 2015], part of the CFOUR package [Stanton et al.
2014].
1The source code is publicly available at: www.github.com/springer13/spin-summations.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:3
— We illustrate that the high-performance algorithm can also be used for in-place spin
summations of 2D, 3D or 4D “square” tensors, reducing the memory footprint by 2×.
The remainder of this publication is structured as follows: Section 2 reviews re-
lated work on tensor transposition and related transformations such as unpacking
of anti-symmetric tensors. Section 3 familiarizes the reader with spin summations,
the nomenclature used, and the cache hierarchy of a modern CPU. Section 4 presents
different implementations for spin summations and discusses their advantages and
disadvantages. Section 5 introduces the exhaustive list of spin summations of interest
and describes the measurement environment. Section 6 evaluates the performance of
the different implementations and highlights their key features. Section 7 concludes
our findings.
2. RELATED WORK
Matrix transposition (i.e. 2D tensor transposition) is a well studied operation, includ-
ing optimizations for blocking, vectorization, unrolling, and software prefetching [Mc-
Calpin and Smotherman 1995; Mateescu et al. 2012; Lu et al. 2006; Goldbogen 1981;
Vladimirov 2013; Chatterjee and Sen 2000]. For 3D tensors, the same optimizations
are investigated in the context of out-of-place transpositions on CPUs [van Heel 1991;
Jodra et al. 2015].
The optimization of arbitrary-dimensional tensor transpositions has gained more
interest in recent years [Wei and Mellor-Crummey 2014; Lyakh 2015; Springer et al.
2016; Springer and Bientinesi 2016b]. Our previous work on the Tensor Transposition
Compiler (TTC) [Springer et al. 2016; Springer and Bientinesi 2016b] relied on code
generation to yield a nearly-optimal kernel for any given tensor transposition. While
the generated code attained high performance in the general case, it was only ap-
plicable to tensor transpositions that were known at compile time. For this reason,
we developed HPTT [Springer et al. 2017], a high-performance C++ library for tensor
transpositions that maintains the desirable properties of TTC (e.g. explicit vectoriza-
tion, loop reordering, autotuning) but avoids the code generation process.
While there has been some focus on in-place tensor transpositions [Ding 2001; He
and Ding 2002; Catanzaro et al. 2014], their performance—in the general case—has
been found to be limited.
In the context of data reuse in multidimensional operations, algorithms for un-
packing anti-symmetric tensors into matrices were studied by Hanrath et al. [Han-
rath and Engels-Putzka 2010]. They identify both “matrix-driven” and “tensor-driven”
approaches, which are essentially equivalent to Algorithm 1 presented later. These
approaches are also similar in terms of data access patterns to configuration inter-
action (CI) algorithms, which come in both “integral-driven” [Saunders and Lenthe
1983; Ansaloni et al. 2000; Klene and Robb 2000; Gan et al. 2003] and “configuration-
driven” [Buenker and Krebs 1999; Jiang et al. 2009] varieties. In each of these cases,
accesses to one multidimensional object is linearized at the expense of scattered ac-
cesses to another object. Blocking for the cache hierarchically has been studied and
successfully applied in the context of stencil algorithms [Datta et al. 2008; Nguyen
et al. 2010]; this is in several ways similar to the multidimensional tensor case.
3. BACKGROUND
To keep this publication self-contained, we introduce the used nomenclature and ex-
plain necessary architectural properties of a modern CPU.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:4 Paul Springer et al.
3.1. Nomenclature
A d-dimensional tensor A ∈ Rn1×...×nd is stored as a d-dimensional array adhering
to the Fortran storage scheme (i.e. column-major). For instance, given a 3D tensor
A ∈ Rn1×n2×n3 , the element Ai1i2i3 , for any index (i1, i2, i3) ∈ {0, ..., n1− 1}×{0, ..., n2−
1} × {0, ..., n3 − 1}, is placed in the linearized memory location &(A) + i1 + i2n1 +
i3n1n2 (& denotes the “address of” operator). Due to the nature of our problem we
are only concerned with three- and four-dimensional tensors for which the sizes of all
dimensions are identical (i.e. n1 = n2 = n3 = n4 = N ).
DEFINITION 1 (PERMUTATION). Given a bijective function
φ : {1, 2, ..., d} → {1, 2, ..., d}, (1)
we define the permutation pi that operates on d-tuples as:
pi : Nd → Nd, (i1, i2, ..., id) 7→ (iφ(1), iφ(2), ..., iφ(d)). (2)
Instead of defining pi explicitly, we use the shorter notation: piφ(1)φ(2)...φ(d).
Example: pi321 applied to an index (i1, i2, i3) inverts its order and yields the permuted
index pi321(i1, i2, i3) = (i3, i2, i1).
DEFINITION 2 (TENSOR PERMUTATION). Given a tensor A ∈ Rn1×...×nd and a per-
mutation pi, we formally express a tensor permutation as
Bi1i2...id ← Api(i1,i2,...,id) ∀ (i1, i2, ..., id) ∈ {0, ..., n1 − 1} × {0, ..., n2 − 1} × ...× {0, ..., nd − 1}.(3)
Since this notation is quite verbose, we revert to a shorter form where pi acts as an
operator on the entire tensor such that a tensor permutation can be written as B ← pi(A).
While we use the same symbol pi both as an operator that acts on indices as well as on
tensors, these two case are easily distinguishable by the context.
Example: A matrix transposition of A,B ∈ RN×N can be expressed via the tensor
permutation: B ← pi21(A).
DEFINITION 3 (SPIN SUMMATION). In its most general form, a spin summation of
a d-dimensional tensor A is a sum of n tensor permutations ω1, ..., ωn each scaled by
some αi ∈ R:
B ←
n∑
i=1
αiωi(A)
= (α1ω1 + α2ω2 + ...+ αnωn)(A). (4)
Example: B ← (pi123 + pi213 + pi321 + pi231 + pi132 + pi312)(A) is spin summation where all
αi are equal to one.
Most of the spin summations that we are interested in exhibit a very important
property, which we refer to as factorizable.
DEFINITION 4 (FACTORIZABLE SPIN SUMMATION). A spin summation B ←∑n
i=1
αiωi(A) is factorizable, if it can be separated into two successive spin summations
such that:
T ←
n̂∑
i=1
α̂iω̂i(A), (5)
B ←
n˜∑
i=1
α˜iω˜i(T ), (6)
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:5
with n̂+ n˜ ≤ n and T ∈ Rn1×...×nd . This definition can be applied recursively, factoring
the initial spin summation into more than two stages.
Example: The spin summation
B ← (pi123 + pi132 + pi213 + pi231 + pi312 + pi321)(A) (7)
can be factorized as
B ← (pi123 + pi213)((pi123 + pi321 + pi132)(A)). (8)
Note that that every spin summation B ← ∑n
i=1
αiωi(A) is factorizable without the
constraint n̂ + n˜ ≤ n. For instance, let ω̂ be an arbitrary permutation, then all factor-
izations of the following form are viable:
T ← ω̂(A), (9)
B ←
n∑
i=1
αiω˜i(T ), (10)
where ω˜i = (ωi ◦ ω̂−1) denotes the composition of ωi and the inverse of ω̂.
3.2. Cache Hierarchy
The cache hierarchy of modern CPUs bridges the ever widening gap between the per-
formance of the CPU’s floating-point units and its memory subsystem2 by buffering
frequently used data in fast on-chip memory (i.e. caches and registers).
Caches exploit the fact that data, once loaded from main memory, is likely accessed
again soon after; this effect is referred to as temporal locality. Thus, repeated accesses
to the same memory location are served by the fast caches as opposed to the slow main
memory. Another important feature of caches is that they fetch data at the granularity
of a cacheline (typically 64 bytes = 16 single- = 8 double-precision elements). This
means that an access to a memory location l not only loads the element situated at
l but indeed its entire cacheline. Thus, a subsequent access to l + 1 (assuming that l
and l + 1 belong to the same cacheline) is served by the cache and does not require an
additional load from main memory; this process is referred to as spatial locality.
The caches of a CPU are organized into multiple levels, which vary in size, band-
width and latency [Intel Corporation 2015]. The Intel Xeon E5-2680 v3 CPU, for in-
stance, has three levels: L1, L2, and L3 with decreasing speed and increasing size.
While L1 and L2 are private to each core, the L3 (also known as last level cache,
LLC) is shared between all the cores. Since modern CPUs use a cache coherency proto-
col [Patterson and Hennessy 2007; Bryant et al. 2003] to keep all the caches of a CPU
coherent, it is important that different cores do not write to the same cacheline; such
a situations is referred to as false sharing and results in coherence traffic among the
cores that can significantly lower the performance.
Modern write-back caches typically employ the write-allocate policy [Bryant et al.
2003]: A write-miss (i.e. a write to a memory location that is not in any level of the
cache hierarchy) fetches the corresponding cacheline from main memory before up-
dating it; this additional memory traffic is referred to as write-allocate traffic. Such a
mechanism is favorable if the data is be accessed again in the near future. So called
non-temporal store instructions avoid this write-allocate traffic by writing an entire
2The floating-point unit of one Intel Xeon E5-2680 v3 core can issue two fused-multiply-adds (c← a×b+c)
per cycle; each fused-multiply-add operates on 3× 8 single-precision elements simultaneously. The full Intel
Xeon E5-2680 v3 CPU—assuming a turbo boost frequency of 3.1 GHz—would require a staggering main
memory bandwidth of 6652 GiB/s to keep all its floating-point units—across all its 12 cores—busy. This
value is in sharp contrast to the CPU’s theoretical peak memory bandwidth of 63.3 GiB/s.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:6 Paul Springer et al.
Algorithm 1: Spatial locality in B
1 #pragma omp parallel for collapse(2) schedule(static)
2 for (i3 = 0; i3 < N ; i3++) do
3 for (i2 = 0; i2 < N ; i2++) do
4 for (i1 = 0; i1 < N ; i1++) do
5 Bi1i2i3 ← Ai1i2i3 +Ai1i3i2 +Ai2i1i3 +Ai2i3i1 +Ai3i1i2 +Ai3i2i1
cacheline without fetching it from main memory prior to the write (see [Intel Corpora-
tion 2015] for more details).
3.3. Blocking
Any non-trivial real-world application has to utilize a CPU’s rich cache hierarchy to
attain good performance. One of the most powerful optimization techniques to enable
such a design is blocking. Blocking refers to an reordering of the memory accesses to
make better use of the fast on-chip memory (i.e. caches and registers). This optimiza-
tion often entails a significant modification of the original algorithm. A very prominent
operation that makes extensive use of this technique is matrix-matrix multiplication.
The interested reader is referred to [Gunnels et al. 2001; Goto and Geijn 2008; Van
Zee and van de Geijn 2015] for further details.
4. ROAD TO HIGH PERFORMANCE
The following discussion introduces several implementations of the spin summations
in Eq. 7–8; this spin summation is small enough to serve as an example throughout
this paper, yet it is rich enough to demonstrate all features of interest. The implemen-
tations presented in Sections 4.1–4.5 employ optimization strategies targeting differ-
ent features of modern CPUs. We analyze their advantages and disadvantages and lay
the foundation for a high-performance implementation (see Section 4.6) that combines
the advantages of its predecessors while avoiding their drawbacks.
4.1. Algorithm 1: Locality in B
Algorithm 1 is a direct translation of Eq. 7 into code. Provided that the compiler is
smart enough to accumulate the six elements from A (Line 5) at the register level, this
algorithm only requires one write to each memory location of B, resulting in perfect
temporal locality for B. Moreover, the loops are ordered such that the output tensor
B is traversed in a linear fashion. While this ordering leads to a preferable memory
access pattern for B, it causes an unfavorable access pattern forA. Phrased differently,
this algorithm perfectly exploits both the spatial locality as well as temporal locality
in B, but it only exhibits very limited locality in A. More precisely, the spatial locality
in A is limited to the accesses of Ai1i2i3 and Ai1i3i2 that have the same stride-one index
(i.e. i1) as the accesses to B; all other accesses to A (i.e. Ai2i1i3 ,Ai2i3i1 ,Ai3i1i2 ,Ai3i2i1 )
have a different stride-one index. Likewise, temporal locality in A is only achieved if
any of i1, i2, or i3 are identical; these situations are negligible. Furthermore, due to
the lack of spatial locality, this implementation is not vectorizable without relying on
expensive gather memory operations.
Thanks to the selected loop order, this algorithm can be efficiently parallelized via
a single OpenMP statement (Line 1). This statement parallelizes all loops correspond-
ing to the indices that are not the stride-one index of B. This strategy is ideal in the
sense that it avoids false-sharing between the threads while still providing as much
parallelism as possible.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:7
Algorithm 2: Temporal locality in A and B
1 #pragma omp parallel for schedule(static,1)
2 for (i3 = 0; i3 < N ; i3++) do
3 for (i2 = 0; i2 ≤ i3; i2++) do
4 for (i1 = 0; i1 ≤ i2; i1++) do
5 for (pi ∈ {pi123, pi132, pi213, pi231, pi312, pi321}) do
6 Bpi(i1,i2,i3) ← Api(i1,i2,i3) +Api(i2,i1,i3) +Api(i3,i2,i1)+
7 Api(i2,i3,i1) +Api(i1,i3,i2) +Api(i3,i1,i2)
4.2. Algorithm 2: Temporal Locality in A and B
The key idea of Algorithm 2, outlined in this subsection, is to exploit the structure of
spin summations and transform the scattered memory accesses to A into a structured
form that makes use of the temporal locality in both the input and output tensor.
One noticeable difference to the previous algorithms is that this algorithm limits the
range of the loops in Lines 2-4 such that the loop variables i1, i2, and i3 do not span
the entire iteration space N3 but only a smaller tetrahedron (0 ≤ i1 ≤ i2 ≤ i3 < N ).3
The smaller iteration space is required due to the mechanics introduced by the new
loop in Line 5. This loop iterates over all d! permutations (Line 5) of the loop variables
i1, i2, and i3 (e.g. (i1,i2,i3), (i1,i3,i2), (i2,i1,i3)).4 The conceptual idea behind this loop is
illustrated in Fig. 1. Consider the spin summation B ← (pi123 + pi321 + pi132)(A) (see
Fig. 1a): each element on the left (e.g. Bi1i2i3 ,Bi1i3i2 ) denotes a different entry of B for
some fixed i1,i2,i3 ∈ {0, 1, ..., N − 1} and is computed in one iteration of the innermost
loop (see Algorithm 2, Line 5); likewise the elements on the right (e.g. Ai1i2i3 ,Ai1i3i2 )
denote different entries of A. The connections between the nodes represent dependen-
cies between these elements; the number of outgoing edges of each node depends on
the actual spin summation (compare Fig. 1a and Fig. 1b). For instance, in Fig. 1a, the
first iteration of the innermost loop computes Bi1i2i3 ← Ai1i2i3 +Ai3i2i1 +Ai1i3i2 , while
the second iteration computes Bi1i3i2 ← Ai1i3i2 +Ai2i3i1 +Ai1i2i3 .
The colors in Fig. 1 identify different connected components;5 this separation into
connected components is not yet important, but we revisit this property in a Sec-
tion 4.6.4.
Each iteration of the innermost loop accesses exactly six memory locations of A
(Lines 6–7). Across all d! iterations of this loop this totals 6×d! accesses to A; however,
only d! of these memory locations are distinct. Thus, every element of A, once loaded,
is reused exactly six times, which results in perfect temporal locality for both tensors
(as long as at least d! elements can be stored in cache). Phrased differently, no element
of A (or B) needs to be loaded twice, if the memory subsystem would operate at the
granularity of an element as opposed to a cacheline.
Figure 2 illustrates the underlying idea of Algorithm 2 on the two-dimensional ex-
ample B ← (α1pi12 + α2pi21)(A). The two update statements (see Fig. 2, right) update
distinct areas of B (denoted by the shaded regions). Both elements Bi1i2 ( ) as well
as Bi2i1 ( ) depend on both Ai1i2 ( ) and Ai2i1 ( ). Thus, temporal locality is achieved
3We ignore special handling of cases where any of the indices i1, i2, and i3 are identical for better readability.
The actual implementation takes care of these cases.
4This loop is only visible in this listing, the actual implementation fully unrolls this loop such that all d!
updates statements are explicit, and omits redundant updates when any of i1, i2, or i3 are equal.
5Two arbitrary nodes a and b belong to the same connected component, if and only if there exists a path
from a to b or from b to a.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:8 Paul Springer et al.
Bi1i2i3 Ai1i2i3
Bi1i3i2 Ai1i3i2
Bi2i1i3 Ai2i1i3
Bi2i3i1 Ai2i3i1
Bi3i1i2 Ai3i1i2
Bi3i2i1 Ai3i2i1
(a) B ← (pi123 + pi321 + pi132)(A)
Bi1i2i3 Ai1i2i3
Bi2i1i3 Ai2i1i3
Bi1i3i2 Ai1i3i2
Bi3i1i2 Ai3i1i2
Bi2i3i1 Ai2i3i1
Bi3i2i1 Ai3i2i1
(b) B ← (pi123 + pi213)(A)
Fig. 1: Dependencies between elements of B andA for some fixed i1, i2, i3 ∈ {0, 1, ..., N−
1} on the example of the two spin summations. Connected components are highlighted
by different colors.
A AB
+←
i2i1 i2i1 i2i1
i1i2 i1i2 i1i2
For i2 = 0 : N − 1
For i1 = 0 : i2
Bi1i2 ← α1Ai1i2 + α2Ai2i1
Bi2i1 ← α1Ai2i1 + α2Ai1i2
Fig. 2: Visualization of the 2D spin summation B ← (α1pi12 + α2pi21)(A). The shaded
areas highlight the iteration space processed by the two update statements (right).
Identical memory locations are identified by equally colored squares (i.e. both A’s re-
fer to the same memory location). Cachelines are denoted by red rectangles and next
elements are denoted by striped patches.
since both update statements are executed back-to-back resulting in perfect reuse of
the operands.
One of the fundamental problems of this algorithm, however, is its lack of spatial
locality. As Fig. 2 highlights, the next elements ( ), accessed within the next iteration
of the innermost loop, do not all belong to the previously-loaded cachelines ( ). In other
words: This algorithm does not exhibit spatial locality for all memory accesses. While
this problem does not seem too severe for the two-dimensional example, it becomes
significantly accentuated in higher dimensions.
The missing spatial locality in B is especially troublesome for performance for three
reasons: First and foremost, while this algorithm provides perfect temporal locality at
the element level (i.e. no element needs to be loaded from main memory more than
once), it does not exhibit this property at the cache line level (i.e. by the time another
element—of a previously loaded—cachelines is accessed again this cacheline might
already have been evicted from the caches). Second, the scattered memory accesses
within the innermost loop (see Algorithm 2, Lines 6–7) prevent efficient vectorization.
Finally, parallelizing any of the loops associated to i1, i2, and i3 causes false sharing
between the threads and thus hampers performance.
4.3. Algorithm 3: Reduced FLOP-count
So far we have targeted our optimizations at the caches. In contrast, this section aims
to reduce the required amount of floating-point operations by exploiting the factoriz-
ability of spin summations (Definition 4). Algorithm 3 is implemented in terms of two
dependent spin summations (Lines 1–5 and 6–10) that are directly translated from
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:9
Algorithm 3: Reduced FLOP-count
1 #pragma omp parallel for collapse(2) schedule(static)
2 for (i3 = 0; i3 < N ; i3++) do
3 for (i2 = 0; i2 < N ; i2++) do
4 for (i1 = 0; i1 < N ; i1++) do
5 Ti1i2i3 ← Ai1i2i3 +Ai3i2i1 +Ai1i3i2
6 #pragma omp parallel for collapse(2) schedule(static)
7 for (i3 = 0; i3 < N ; i3++) do
8 for (i2 = 0; i2 < N ; i2++) do
9 for (i1 = 0; i1 < N ; i1++) do
10 Bi1i2i3 ← Ti1i2i3 + Ti2i1i3
their factorized form (Eq. 8); both of these implementations follow the design of Algo-
rithm 1. Thus, the aforementioned drawbacks of this algorithm are also preserved.
We commence by observing that this algorithm requires a significant amount of ad-
ditional memory, on the order of the problem size, to store the auxiliary tensor T .
Second, this algorithm reduces the memory accesses to A from 6N3 to 3N3, but it also
adds three additional memory accesses (one write + two reads) per element of T , which
have not been required before. While the total amount of memory operations remained
unchanged for this test case, other spin summations of interest significantly benefit
from this optimization (see Section 6).
4.4. Algorithm 4: Reduced FLOP-count without Auxiliary Memory
Algorithm 4: Reduced FLOP-count + Fused
1 #pragma omp parallel for schedule(static,1)
2 for (i3 = 0; i3 < N ; i3++) do
3 for (i2 = 0; i2 ≤ i3; i2++) do
4 for (i1 = 0; i1 ≤ i2; i1++) do
5 for (pi ∈ {pi123, pi132, pi213, pi231, pi312, pi321}) do
6 Tpi(i1i2i3) ← Api(i1i2i3) +Api(i3i2i1) +Api(i1i3i2)
7 for (pi ∈ {pi123, pi132, pi213, pi231, pi312, pi321}) do
8 Bpi(i1i2i3) ← Tpi(i1i2i3) + Tpi(i2i1i3)
Algorithm 4 is a combination of Algorithm 2 and Algorithm 3; its main goal is to
combine their advantages while avoiding the need for an auxiliary tensor.
Instead of implementing each of the spin summations in Algorithm 4 in terms of
Algorithm 1, we could alternatively use Algorithm 2. This change allows the loops
in Lines 2–4 and 7–9 of Algorithm 3 to be fused, which—more importantly—reduces
the size of the auxiliary tensor T from Nd to merely d! elements. Thus, accesses to T
become negligible and do not need to be accounted for any longer; this also reflects
positively on the number of memory operations that have to go to main memory. More
precisely, if one neglects the presence of any cache and all other side-effects (e.g. false
sharing), then one could come to the conclusion that Algorithm 4 is preferable over
Algorithm 3 because it reduces the amount of memory operations from (1 + 6)N3 to
(1 + 3)N3.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:10 Paul Springer et al.
A:10 Paul Springer et al.
Summation over Tensor Transpositions: A High-Performance Perspective A:9
A:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: BlockingA:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: Blocking
Algorithm 5: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; a+=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 6:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c=c˜; c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 for (c=c˜; c < c˜+ bl; c++) do
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
Algorithm 5a capitalizes on the analysis of the previous sections and combines the
benefits of Algorithms 1-4. The algorithm presented in this section forms the founda-
tion for the truly high-performance implementation of Section 3.6.
As mentioned above, Algorithm 4 already comes with some desirable properties such
as reduced flop-count as well as temporal locality at the element level. However, as
pointed out earlier, this locality does not extend to the cache level. Thus, the main ob-
jective of Algorithm 5a is to provide temporal locality at the cache level (i.e., a cacheline
once loaded frommain memory does not have to be loaded again). The key optimization
to enable such a design is blocking (see Section 2.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called macro kernel (see Algorithm 5b).
The most important parameter for this algorithm is the blocksize bl.3 This parameter
is subject to several constraints in order to attain high performance: First, bl should be
3For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
A:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: Blocking
Algorithm 5: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; a+=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 6:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c=c˜; c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 for (c=c˜; c < c˜+ bl; c++) do
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
Algorithm 5a capitalizes on the analysis of the previous sections and combines the
benefits of Algorithms 1-4. The algorithm presented in this section forms the founda-
tion for the truly high-performance implementation of Section 3.6.
As mentioned above, Algorithm 4 already comes with some desirable properties such
as reduced flop-count as well as temporal locality at the element level. However, as
pointed out earlier, this locality does not extend to the cache level. Thus, the main ob-
jective of Algorithm 5a is to provide temporal locality at the cache level (i.e., a cacheline
once loaded frommain memory does not have to be loaded again). The key optimization
to enable such a design is blocking (see Section 2.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called macro kernel (see Algorithm 5b).
The most important parameter for this algorithm is the blocksize bl.3 This parameter
is subject to several constraints in order to attain high performance: First, bl should be
3For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
Algorithm 5: Blocked
Algorith 5a: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; +=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 5b:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c= ;˜ c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 for (c=c˜; c < c˜+ bl; c++) do
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
A:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: BlockingA:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: Blocking
Algorithm 5: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; a+=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 6:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c=c˜; c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 for (c=c˜; c < c˜+ bl; c++) do
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
Algorithm 5a capitalizes on the analysis of the previous sections and combines the
benefits of Algorithms 1-4. The algorithm presented in this section forms the founda-
tion for the truly high-performance implementation of Section 3.6.
As mentioned above, Algorithm 4 already comes with some desirable properties such
as reduced flop-count as well as temporal locality at the element level. However, as
pointed out earlier, this locality does not extend to the cache level. Thus, the main ob-
jective of Algorithm 5a is to provide temporal locality at the cache level (i.e., a cacheline
once loaded frommain memory does not have to be loaded again). The key optimization
to enable such a design is blocking (see Section 2.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called macro kernel (see Algorithm 5b).
The most important parameter for this algorithm is the blocksize bl.3 This parameter
is subject to several constraints in order to attain high performance: First, bl should be
3For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
A:8 Paul Springer et al.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the aforementioned drawbacks in Section 3.2 still apply.
3.5. Algorithm 5: Blocking
Algorithm 5: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; a+=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 6:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c=c˜; c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 for (c=c˜; c < c˜+ bl; c++) do
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
Algorithm 5a capitalizes on the analysis of the previous sections and combines the
benefits of Algorithms 1-4. The algorithm presented in this section forms the founda-
tion for the truly high-performance implementation of Section 3.6.
As mentioned above, Algorithm 4 already comes with some desirable properties such
as reduced flop-count as well as temporal locality at the element level. However, as
pointed out earlier, this locality does not extend to the cache level. Thus, the main ob-
jective of Algorithm 5a is to provide temporal locality at the cache level (i.e., a cacheline
once loaded frommain memory does not have to be loaded again). The key optimization
to enable such a design is blocking (see Section 2.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called macro kernel (see Algorithm 5b).
The most important parameter for this algorithm is the blocksize bl.3 This parameter
is subject to several constraints in order to attain high performance: First, bl should be
3For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
Algorithm 5: Blocked
Algorith 5a: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (c=0; c < N ; c+=bl) do
4 for (b=0; b < c; b+=bl) do
5 for (a=0; a < b; +=bl) do
6 #pragma omp task
7 macroKernel(A, B, a, b, c)
Algorithm 5b:Macro Kernel
1 Function macroKernel(A, B, a˜, b˜, c˜):
2 for (c= ;˜ c < c˜+ bl; c++) do
3 for (b=b˜; b < b˜+ bl; b++) do
4 for (a=a˜; a < a˜+ bl; a++) do
5 for (⇡ 2  3) do
6 T⇡(abc)  2A⇡(abc)  A⇡(cba)  A⇡(acb)
7 or (c=c˜; c < c˜+ bl; c++) o
8 for (b=b˜; b < b˜+ bl; b++) do
9 for (a=a˜; a < a˜+ bl; a++) do
10 for (⇡ 2  3) do
11 B⇡(abc)  2T⇡(abc)   T⇡(bac)
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
Algorithm 5: The blocked algorithm (left) is implemented in terms of a macro kernel
(right). Each invocation of the macro kernel calculates d! blocks—of size bl3—of the
output tensor. T refers to auxiliary memory. We omit the cases where any of i1, i2 or i3
are iden cal to improve the read bility of the pseudo code.
While this algorithm reduces the FLOP-count and removes the requirement for an
auxiliary tensor, the drawbacks mentioned in Section 3.2 still apply.
3.5. Algorithm 5: Blocking
Algorithm 5a: Blocking
1 #pragma omp parallel
2 #pr gma omp single
3 for (i3=0; i3 < N ; i3+=bl) do
4 for (i2=0; i2 < i3; i2+=bl) o
5 fo (i1=0; i1 < i2; i1+=bl) do
6 #pragma omp task
7 macroKernel(A, B, i1, i2, i3)
The algorithm presented in this section capitalizes on the analysis of the previous
sections and combines the benefits of Algorithms 1-4. Algorithms 5 form the foundation
for the truly high-performance implementation presented in Section 3.6.
While algorithm 4 already expresses some desirable properties such as reduced flop-
count as well as temporal locality for individual elements, this locality does not extend
to entire cachelines. Thus, the main objective of Algorithm 5 is to provide temporal
locality at the cache level (i.e., a cacheline is only loaded from main memory once). The
key optimization to enable such a design is blocking (see Section 2.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to on scalars. The calculation of
these blocks is separated out into a so called macro kernel (see Algorithm 5b). All in-
vocations of the macro kernel are independent from one another and can, therefor, be
computed in parallel by different threads. In contrast to the loop-based parallelism (Al-
gorithms 1-4), this algorithm uses tasked-based parallelism. More precisely, a single
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
A:10 Paul Springer et al.
Algorithm 5b:Macr Kernel
1 Func macroKerne (A, B, i˜1, i˜2, i˜3):
2 for (i3=i˜3; i3 < i˜3 + bl; i3++) do
3 for (i2=i˜2; i2 < i˜2 + bl; i2++) do
4 for (i1=i˜1; i1 < i˜1 + bl; i1++) do
5 for (⇡ 2 {⇡123,⇡132,⇡213,⇡231,⇡312,⇡321}) do
6 T⇡(i1i2i3)  A⇡(i1i2i3) +A⇡(i3i2i1) +A⇡(i1i3i2)
7 for (i3=i˜3; i3 < i˜3 + bl; i3++) do
8 for (i2=i˜2; i2 < i˜2 + bl; i2++) do
9 for (i1=i˜1; i1 < i˜1 + bl; i1++) do
10 for (⇡ 2 {⇡123,⇡132,⇡213,⇡231,⇡312,⇡321}) do
11 B⇡(i1i2i3)  T⇡(i1i2i3) + T⇡(i2i1i3)
thread (Algorithm 5a, Line 2) generates all the tasks (Algorithm 5a, Line 6-7) which
are then dynamically processed by any of the available threads.
The most important parameter for this algorithm is the blocksize bl.5 This parame-
ter is subject to several constraints imposed by our high performance objective: First,
bl should be a multiple of the cacheline size—in elements—(e.g., 8 for double preci-
sion calculations) to optimally exploit spatial locality. Second, all 2 ⇥ d! blocks—per
thread—must fit into the shared LLC. Moreover, bl should be as large as possible to
facilitate more consecutive memory accesses and (if possible) take advantage of the
adjacent cacheline prefetching feature of modern CPUs; moreover, larger blocks reduce
the parallelization overhead introduced by the tasked-based parallelization scheme.
These constraints respectively suggest bl = 16 for and bl = 8 for 3D and 4D spin sum-
mations. More precisely, 3D and 4D spin summations respectively require 384KiB and
1536KiB of cache per thread; this fits well into the available last level cache of modern
CPUs (e.g., an Intel Xeon E5-2680 v3 offers 2560KiB LLC per core). We point out that
The scratchpad memory T (see Algorithm 5b) is allocated as a contiguous chunk
of data in order to avoid severe cache conflicts (caused by the limited associativity of
the caches). Moreover, each thread initializes its portion of T individually to account
for the non-uniform memory accesses of modern CPUs. We point out that—despite our
best efforts to keep all accessed blocks in cache—, this algorithm may still experience
few redundant loads from main memory due to the accesses to A;6 we suggest an
optimization for this scenario in Section 3.6.4.
The small—yet important—distinctions between operating on blocks (Algorithm 5a)
as opposed to scalars (Algorithm 4) entails several positive effects on performance:
First and foremost, provided that the blocks remain in some level of the cache hierar-
chy, this is the first algorithm that achieves temporal locality for entire cachelines for
both A and B. Second, this algorithm does not experience any false-sharing between
the threads. Finally, the abundant available parallelism and the dynamic tasked-based
scheduling together provide good load-balancing; for instance, a tensor as small as 404
elements (equivalent to 19.5MiB) already offers 70 independent tasks.
5For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
6The blocks of A are not consecutive in main memory and, therefor, may cause cache conflicts.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
Algorithm 5: The blocked algori m (left) is mplemented in t rms of a macro kernel
(right). Each invocation of the macro kernel cal u ates d! blocks—of size bl3—of the
output tensor. T refers to auxiliary memory. We ignore special handling of cases where
any of i˜1, i˜2, or i˜3 are identical to improve the readability of the pseudo code.
Algorithm 5a: Blocking
1 #pragma omp parallel
2 #pragma omp single
3 for (˜i3=0; i˜3  N ; i˜3+=bl) do
4 for (˜i2=0; i˜2  i˜3; i˜2+=bl) do
5 for (˜i1=0; i˜1  i˜2; i˜1+=bl) do
6 #pragma omp task
7 macroKernel(A, B, i˜1, i˜2, i˜3)
to entire cachelines. Thus, the main objective of Algorithm 5a is to provide temporal
locality at the cache level (i.e., a cacheline is only loaded from main memory once). The
key optimization to enable such a design is blocking (see Section 3.3).
Algorithm 5a is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5a operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called macro kernel (see Algorithm 5ab). All inv -
cations of the macro kernel are independent from one anothe and can, therefor, be
computed in parallel by differen threads. In contrast to the loop-based para lelism
(Algorithms 1–4), this algorithm uses tasked-based arall lism. More precisely, a sin-
gle thread (Algorithm 5aa, Line 2) generates all the tasks (Algorithm 5aa, Lines 6–7)
which are then dynamically processed by any of the available threads.
The most important parameter for this algorithm is the blocksize bl.5 This parame-
ter is subject to several constraints imposed by our high performance objective: First,
bl should be a multiple of the cacheline size—in elements—(e.g. 8 for double preci-
5For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....TODO –Devin
.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
A:10 Paul Springer et al.
Algorithm 5b:Macro Kernel
1 Function macroKernel(A, B, i˜1, i˜2, i˜3):
2 for (i3=i˜3; i3 < i˜3 + bl; i3++) do
3 for (i2=i˜2; i2 < i˜2 + bl; i2++) do
4 for (i1=i˜1; i1 < i˜1 + bl; i1++) do
5 for (⇡ 2 {⇡123,⇡132,⇡213,⇡231,⇡312,⇡321}) do
6 T⇡(i1i2i3)  A⇡(i1i2i3) +A⇡(i3i2i1) +A⇡(i1i3i2)
7 for (i3=i˜3; i3 < i˜3 + bl; i3++) do
8 for (i2=i˜2; i2 < i˜2 + bl; i2++) do
9 for (i1=i˜1; i1 < i˜1 + bl; i1++) do
10 for (⇡ 2 {⇡123,⇡132,⇡213,⇡231,⇡312,⇡321}) do
11 B⇡(i1i2i3)  T⇡(i1i2i3) + T⇡(i2i1i3)
thread (Algorithm 5a, Line 2) generates all the ta ks (Algorithm 5a, Line 6-7) which
are then dynamically proc ssed by any of the available threads.
The most important parameter for this algori hm is the blocksize bl.5 This parame-
ter is subject to several constraints imposed by our high performance obj ctive: First,
bl should be a multiple of the cacheline size—in elements—(e.g., 8 for double preci-
sion calculations) to optimally exploit spatial locality. Second, all 2 ⇥ d! blocks—per
thread—must fit into the shared LLC. Moreover, bl should be as large as possible to
facilitate more consecutive memory accesses and (if possible) take advantage of the
adjacent cacheline prefetching feature of modern CPUs; moreover, larger blocks reduce
the parallelization overhead introduced by the tasked-based parallelization scheme.
These constraints respectively suggest bl = 16 for and bl = 8 for 3D and 4D spin sum-
mations. More precisely, 3D and 4D spin summations respectively require 384KiB and
1536KiB of cache per thread; this fits well into the available last level cache of modern
CPU (e g., an Intel X on E5-2680 v3 offers 2560KiB LLC p r core). We point out that
The scratchpad memory T (see Algorithm 5b) is allocated as a contiguous chunk
of data in order to avoid severe cache conflicts (caused by the limited associativity of
the cach s). More ver, each thread initializes its portion of T individually to account
for the non-uniform memory accesses of modern CPUs. We point out that—despite our
best efforts to keep all accessed blocks in cache—, this algorithm may still experience
few redundant loads from main memory due to the accesses to A;6 we suggest an
optimization for this scenario in Section 3.6.4.
The small—yet important—distinctions between operating on blocks (Algorithm 5a)
as opposed to scalars (Algorithm 4) entails several positive effects on performance:
First and foremost, provided that the blocks remain in some level of the cache hierar-
chy, this is the first algorithm that achieves temporal locality for entire cachelines for
both A and B. Second, this algorithm does not experience any false-sharing between
the threads. Finally, the abundant available parallelism and the dynamic tasked-based
scheduling together provide good load-balancing; for instance, a tensor as small as 404
elements (equivalent to 19.5MiB) already offers 70 independent tasks.
5For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes. The curious reader is referred to the published source code available at ....
6The blocks of A are not consecutive in main memory and, therefor, may cause cache conflicts.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2016.
Algorithm 5: The blocked algori m (left) is mplemented in t rms of a macro-k rnel
(right). Each invocation of the mac o-kernel c lcul tes d! blocks of size bl3 of the output
tensor. T refers to auxiliary memory. We ignore special handling of cases where any of
i˜1, i˜2, or i˜3 are identical to improve the readability of the pseudo code.
While this algorithm reduces the FLOP-cou t a d removes the requir ment for an
auxiliary tensor, the drawbacks mention d in Section 4.2 still apply.
4.5. Algorithm 5: Blocking
Algorithm 5 presented in this section capitalizes on the analysis of the previous sec-
tions and combines the benefits of Algorithms 1–4. It forms the foundation f r the
high-performance implementation resen ed in Section 4.6.
While Algorithm 4 already expresses some desirable properties such as reduced
FLOP-count as well as temporal locality for individual elements, this locality does not
extend to entire cachelines. Thus, the main objective of Algorithm 5 is to provide tem-
poral locality at the cache level (i.e. a cacheline is only loaded from main memory once).
The key optimization to enable such a design is blocking (see Section 3.3).
Algorithm 5 is very similar to Algorithm 4 with the critical difference that Algo-
rithm 5 operates on blocks of size bl3 as opposed to scalars. The calculation of these
blocks is separated out into a so called acro-kernel (see Algorithm 5b). All i voca-
tions of the macro-kernel are independent from one anothe , and can therefor be com-
puted in parallel by different threads. In contra t to the loop-bas d parall lism (Al-
gorithms 1–4), this algorithm uses tasked-based parallelis . More pr ci ely, a single
thread (Algorithm 5a, Line 2) generates ll the tasks (Algorithm 5a, Lines 6–7), which
are then dynamically processed by any of the available threads.
The most important parameter in this algorithm is the blocksize bl.6 This param-
eter is subject to several constraints imposed by our high performance objective:
First, bl should be a multiple of the cacheline size—in elements (e.g. 8 for double-
precision calculations)—to optimally exploit spatial locality. Second, all 2× d! blocks—
per thread—must fit into the shared last level cache. Moreover, bl should be as large as
possible to facilitate more memory accesses to consecutive elements and (if possible)
take advantage of the adjacent cac eline prefetching f ature of modern CPUs; m re-
6For the course of this discussion, we assume that N is evenly divisible by the blocksize bl. However, the
actual implementation does not have this constraint and generalizes to non-square blocks and, thus, works
for arbitrary sizes.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:11
Algorithm 6: High-Performance macro-kernel. Identical subtensors are
highlighted by the same color. The notation Ti˜1 i˜2 i˜3 [i1, i2, i3] is equivalent toTi˜1+i1 ,˜i2+i2 ,˜i3+i3 ; it is merely used to separate the loop variables, i1, i2, i3 from
the offsets, i˜1, i˜2, i˜3.
1 Function macroKernel(A, B, i˜1, i˜2, i˜3):
2 for (i2 = 0; i2 < bl; i2++) do
3 for (i3 = 0; i3 < bl; i3++) do
4 for (i1 = 0; i1 < bl; i1++) do
5 Ti˜1 i˜2 i˜3 [i1, i2, i3]←Ai˜1 i˜2 i˜3 [i1, i2, i3]+Ai˜3 i˜2 i˜1 [i3, i2, i1]+Ai˜1 ,˜i3 ,˜i2 [i1, i3, i2]
6 Ti˜1 i˜3 i˜2 [i1, i2, i3]←Ai˜1 i˜3 i˜2 [i1, i2, i3]+Ai˜2 i˜3 i˜1 [i3, i2, i1]+Ai˜1 ,˜i2 ,˜i3 [i1, i3, i2]
7 Ti˜2 i˜1 i˜3 [i1, i2, i3]←Ai˜2 i˜1 i˜3 [i1, i2, i3]+Ai˜3 i˜1 i˜2 [i3, i2, i1]+Ai˜2 ,˜i3 ,˜i1 [i1, i3, i2]
8 Ti˜2 i˜3 i˜1 [i1, i2, i3]←Ai˜2 i˜3 i˜1 [i1, i2, i3]+Ai˜1 i˜3 i˜2 [i3, i2, i1]+Ai˜2 ,˜i1 ,˜i3 [i1, i3, i2]
9 Ti˜3 i˜1 i˜2 [i1, i2, i3]←Ai˜3 i˜1 i˜2 [i1, i2, i3]+Ai˜2 i˜1 i˜3 [i3, i2, i1]+Ai˜3 ,˜i2 ,˜i1 [i1, i3, i2]
10 Ti˜3 i˜2 i˜1 [i1, i2, i3]←Ai˜3 i˜2 i˜1 [i1, i2, i3]+Ai˜1 i˜2 i˜3 [i3, i2, i1]+Ai˜3 ,˜i1 ,˜i2 [i1, i3, i2]
// second part Bpi(i1i2i3) ← Tpi(i1i2i3) + Tpi(i2i1i3) omitted ...
over, larger blocks reduce the parallelization overhead introduced by the tasked-based
parallelization scheme. These constraints respectively suggest bl = 16 and bl = 8 for
3D and 4D spin summations. More precisely, 3D and 4D spin summations respectively
require 384KiB and 1536KiB of cache per thread; this fits well into the available last
level cache of modern CPUs (e.g. an Intel Xeon E5-2680 v3 offers 2560KiB LLC per
core).
The scratchpad memory T (see Algorithm 5b) is allocated as a contiguous chunk of
memory in order to avoid severe cache conflicts (caused by the limited associativity of
the caches). Moreover, each thread initializes its portion of T individually to account
for the non-uniform memory accesses of modern CPUs. We point out that—despite our
best efforts to keep all accessed blocks in cache—this algorithm may still experience
few redundant loads from main memory due to the accesses to A;7 we suggest an
optimization for this scenario in Section 4.6.4.
The small—yet important—distinctions between operating on blocks (Algorithm 5)
as opposed to scalars (Algorithm 4) entails several positive effects on performance:
First and foremost, provided that the blocks remain in some level of the cache hier-
archy, this is the first algorithm that achieves temporal locality for entire cachelines
for both A and B. Second, this algorithm does not experience any false-sharing be-
tween the threads since bl is chosen to be a multiple of the cacheline size. Finally, the
abundantly available parallelism and the dynamic tasked-based scheduling together
provide good load-balancing; for instance, a tensor as small as 404 elements (equiva-
lent to 19.5MiB) already offers 70 independent tasks.
All in all, Algorithm 5 exhibits many desirable properties, but it does not offer spatial
locality in the sense that a cacheline—once loaded into L1 cache—is consumed at once;
most of the elements of this cacheline are accessed at a later stage. This delayed access
necessitates an additional load of the cacheline from some higher level (i.e. L2 or L3)
of the cache hierarchy. This disadvantage is addressed in the next section.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:12 Paul Springer et al.
macro-kernel
AB
+ α2×
+ α2×
bl
bl
← α1×
← α1×
Fig. 3: Schematic overview for the vectorization of the 2D spin summation B ← (α1pi12+
α2pi21)(A). Same blocks are colored identically; each block denotes a 2D face consisting
of bl2 elements. The transpose of a block is indicated by an double-edged arrow.
4.6. Algorithm 6: High Performance
We combine the underlying ideas of the previous implementations into a high-
performance algorithm. All optimizations introduced in this section only affect the
macro-kernel (Algorithm 5b); the remainder is identical to Algorithm 5a.
4.6.1. Restructured Block-traversal. The main challenge for a high-performance macro-
kernel is the strided memory accesses pattern of Algorithm 5b. For instance, each of
the loop variables (i.e. i1, i2, i3) associated to the loops in Lines 2–4 of Algorithm 5b is
a stride-one index for either the output or the input tensor at some iteration of the loop
in Line 5. This circumstance poses significant problems with respect to spatial locality
and makes efficient vectorization impossible.
Algorithm 6 outlines the new macro-kernel; despite the fact that it looks quite differ-
ent from its predecessor (Algorithm 5b), they are indeed logically identical. The critical
difference, however, is that Algorithm 6 traverses the d! blocks (each of size bld) in a dif-
ferent order; identical blocks—those blocks for which i˜1, i˜2, and i˜3 appear in the same
order—are colored identically. Algorithm 6 can be deduced from Algorithm 5b via the
following steps: (1) Pull the loop starts (i.e. i˜1, i˜2 and i˜3) into the A... notation; this
simple change renders all loops indistinguishable from one another (i.e. all loops have
the same start position and loop trip count). (2) Now that the loops in Lines 2–4 are
indistinguishable, rename the loop variables within each update statement individu-
ally (Lines 5–10); different names correspond to different traversal orders. While any
traversal is valid from a correctness perspective, some traversal orders are preferable
from a high-performance point of view. Thus, we rename the loop variables in each line
such that both the input and output tensors are only accessed by at most two different
stride-one indices (i.e. the leftmost indices: i1 and i3).
Now that we only have to deal with at most two different stride-one indices (Algo-
rithm 6), we can reorder the loops the loops (Lines 2–4) such that the loops associated
to the stride-one indices are the innermost loops. This strategy greatly improves spa-
tial locality and enables efficient vectorization along these loops (see next Section).
4.6.2. Vectorization. The vectorization scheme for spin summations is quite similar to
that of tensor transpositions [Springer et al. 2016; Springer and Bientinesi 2016b] in
the sense that both operate on 2D faces of arbitrary-dimensional tensors. However,
7The blocks of A are not consecutive in main memory and, therefor, may cause cache conflicts.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:13
in contrast to tensor transpositions, spin summations have to load multiple faces and
add them together—not just one. All these faces have to share at most two different
stride-one indices to utilize fully-vectorized memory operations (avoiding scatter and
gather).
Figure 3 illustrates the vectorization on the example of the 2D spin summation B ←
(α1pi12 + α2pi21)(A). The vectorized macro-kernel proceeds as follows: (1) Load the two
bl2 blocks of A ( , ) into the cache (mostly L1 and L2); (2) add the corresponding
transposed blocks ( , ) to yield the output ( , ); finally, (3) store these blocks to the
correct locations in B. All these steps are fully vectorized (see [Springer et al. 2016;
Springer and Bientinesi 2016b] for details), exploiting spatial locality.
Given that our vectorization requires no more than two different stride-one indices,
the natural question is whether we can always restructure the traversal within the
macro-kernel such that this requirement is met. While most of the spin summations of
interest are vectorizable without any additional work, some require an additional reg-
ularization step. For instance, the spin summation B ← (α1pi123 + α2pi213 + α3pi321)(A)
has three different stride-one indices; thus, any renaming strategy of the indices is
bound to fail. However, such a situation can be avoided by a regularization step that
transposes the input tensor A prior to the actual spin summation and account for
this change at a later stage (see Eq. 9–10). More precisely, instead of computing
B ← (α1pi123 + α2pi213 + α3pi321)(A) directly, one could also transpose T ← pi213(A) and
compute the output as B ← (α1pi213 + α2pi123 + α3pi231)(T ) that only has two different
stride-one indices; this procedure is applicable to all the spin summations that we are
interested in and enables efficient vectorization.
4.6.3. Non-Temporal Stores. To avoid the write-allocate traffic (see Section 3.2) incurred
by the writes to the output tensor B, we use non-temporal store instructions whenever
possible. It is important to realize that non-temporal stores should only be applied to
the writes to B and not to T , since the latter is reused repeatedly. More importantly,
failing to adhere to this policy results in poor performance because cachelines written
via non-temporal stores are marked as invalid, and a later access to such a cacheline
incurs a redundant read from main memory.
To make efficient use of non-temporal stores, it is critical to reorder the loops of
the macro-kernel such that writes to the same cacheline occur soon after one an-
other.8 Such writes enables the underlying microarchitecture to write-combine suc-
cessive writes to the same cacheline into a single memory transaction. Thus, we order
the loops within the macro-kernel such that the loop associated to the stride-one in-
dex of B is the innermost loop (see Algorithm 6); failing to do so results in redundant
reads/writes from/to main memory and, thus, significantly lowers the performance.
4.6.4. Cache Optimization. Section 4.5 already pointed out that the blocking scheme
might require redundant reads from main memory if the accesses to A cause conflict
misses [Bryant et al. 2003]. This problem may be averted by (1) larger caches, (2)
larger cache associativity or (3) smaller cache requirements of the algorithm. While
we clearly cannot do anything about (1) and (2), there is some opportunity for (3).
Recall that the blocking scheme (see Section 4.5) requires each thread to keep all its
blocks—traversed within the macro-kernel—in some level of the cache hierarchy. More
precisely, computing a spin summation over d-dimensional double-precision tensors
with m threads and block of size bld requires a minimal L3 cache size Cmin
Cmin = bld ×m× x× 8bytes, (11)
8On an AVX-enabled architecture one needs two non-temporal stores (each writing 32 byte) to update an
entire cacheline (64 byte).
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:14 Paul Springer et al.
where x = 2 × d! represents the number of blocks traversed within the macro-kernel.
To reduce Cmin one may try to decrease either bl, m, or x. As discussed previously, de-
creasing bl is not desirable and leads to poor performance (data not shown). Similarly,
measurements show that reducing the number of threads m to a point where some
CPU cores are idle also yields bad performance (data not shown) because the benefits
of more last level cache—per thread—are countered by the reduced parallelism. This
leaves us with the last option: reducing x.
Consider the spin summation B ← (pi123+ pi213)(A) outlined in Fig. 1b, where identi-
cally colored nodes correspond to update statements that belong to the same connected
component and therefore share operands. For instance, the two update statements
Bi1i2i3 ← Ai1i2i3 + Ai2i1i3 and Bi2i1i3 ← Ai2i1i3 + Ai1i2i3—highlighted in blue—share
the same elements (i.e. Bi1i2i3 ,Bi2i1i3 ,Ai1i2i3 ,Ai2i1i3 ) but they do not use any of the
elements belonging to other updates; the same analysis extends to blocks instead of
elements. Thus, blocks of B that belong to different connected components may be
computed separately from one another without losing any spatial locality or temporal
locality. For instance, separating a spin summation over d-dimensional tensors into c
connected components reduces the amount of blocks x from 2× d! to 2× d!c . Thus, this
technique can reduce the pressure on the cache hierarchy significantly.
This optimization is not applicable to all spin summations: For instance, for the spin
summations B ← (pi123 + pi321 + pi132)(A) (Fig. 1a), a decomposition into independent
connected components is not possible.
4.6.5. In-Place. In-place spin summations (i.e. A ← ∑n
i=1
αiωi(A)) facilitate another
effective opportunity to avoid the write-allocate traffic. While high-performance im-
plementations of in-place tensor transpositions for 2D, non-square tensors are a hard
problem [Ding 2001; He and Ding 2002], our algorithm is (almost) immediately appli-
cable to in-place, “hyper-square” spin summations. The support for in-place spin sum-
mations is mostly contributed to the auxiliary scratchpad memory T and the temporal
locality of the algorithm. For instance, instead of directly computing the factorized spin
summation T ← (pi123+pi321+pi132)(A) followed by B ← (pi123+pi213)(T ), we change the
latter statement to A ← (pi123 + pi213)(T ). This is possible since all memory locations
in A that have already been used once and are not used again. Some spin summa-
tions of interest, however, do not require any scratchpad memory because they are not
factorizable (e.g. B ← (pi123 + pi321 + pi132)(A)). For these spin summations, we intro-
duce an additional preprocessing step—similar to the regularization step outlined in
Section 4.6.2—which copies the blocks of A into temporal blocks T . For instance, the
in-place spin summation A ← (pi123 + pi321 + pi132)(A) is computed in two stages: (1)
T ← pi123(A) followed by (2) A ← (pi123 + pi321 + pi132)(T ); the amount of additional
memory is very limited, because the size of these blocks is chosen such that they all fit
into the caches simultaneously.
4.6.6. Summary. Table I summarizes the properties of all previously mentioned algo-
rithms. While Algorithms 1–5 score positively on some properties and negatively on
others, the high-performance algorithm (see Section 4.6) is the only one that scores
positively across all properties.
5. MEASUREMENT ENVIRONMENT & BENCHMARK
We evaluate the performance of Algorithms 1–6 on a two-socket system consisting of
two Intel Xeon E5-2680 v3 CPUs (12 cores each). The compiler of choice is Intel’s icpc
16.0.2 using the -O3 -qopenmp -xHost options. Moreover, we use one thread per phys-
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:15
TMP locality SP locality
Algo A B A B VEC LB FLOP MEM IP NTS
1 - + - + - + - + - -
2 + + - - - - - + - -
3 - + - + - + + - - -
4 + + - - - - + + - -
5 + + o o - + + + + -
6 + + + + + + + + + +
Table I: Summary over all features of each implementations. TMP locality, SP locality,
VEC, LB, FLOP, MEM, IP, NTS respectively stand for temporal locality, spatial local-
ity, vectorization, load balancing, reduced FLOP-count, reduced auxiliary memory re-
quirements, support for in-place summations, and support for non-temporal stores. ‘+’,
‘o’ and ‘-’ respectively reflect positively, neutrally or negatively on the corresponding
property.
ical core (i.e. 24 threads in total) and pin logically neighboring threads to physically
neighboring cores.9
Given a spin summation (see Eq. 4) summing over n permutations of an input ten-
sor A ∈ RNd , we report the bandwidth BW and the floating-point performance FP as
follows:
BW = 2×N
d×sizeof(double)
230×Time GiB/s, (12)
FP = 2×n×N
d
109×Time GFLOP/s. (13)
Notice that the bandwidth metric is especially conservative since we only account
for one read of the input tensor A, independent of the number of permutations n. The
floating-point performance, on the other hand, accounts for the effective FLOP-count
corresponding to the FLOPS performed by Algorithms 1 and 2.
To put our results into perspective, we respectively report the STREAM [McCalpin
1995] copy- (b← a), scale- (b← αa), add- (c← a+b) and triad-bandwidth (c← αa+b)
for the full system: 101.2 GiB/s, 101.3 GiB/s, 107.6 GiB/s and 107.7 GiB/s. More-
over, the total theoretical peak floating-point performance of our two-socket system is
1113.6GFLOP/s; MKL attains 845GFLOP/s for a large matrix-matrix multiplication.
All performance results are based on the minimum execution time over several runs
to get the most stable timings; the caches are cleared before every run (i.e. all mea-
surements start on cold data). All computations use double-precision floating-point
accuracy.
5.1. Benchmark
Table II lists all spin summations required for the non-orthogonally spin-adapted
CCSDT [Noga and Bartlett 1987] and CCSDTQ [Kucharski and Bartlett 1992] quan-
tum chemistry methods, as well as their approximate forms such as CCSDT(Q)
[Bomble et al. 2005]. We test the performance of our implementations with three dif-
ferent problem sizes: small, medium, and large respectively corresponding to tensors
roughly of size 70MiB, 320MiB, and 1200MiB. Unless otherwise mentioned, we present
results for the medium-sized problems.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:16 Paul Springer et al.
ID Test Case
1 B ← (2− pi213)((2− pi321 − pi132)(A))
2 B ← (2− pi321 − pi132)(A)
3 B ← (2− pi213 − pi132)(A)
4 B ← (2− pi213 − pi321)(A)
5 B ← (2−pi2134)((2−pi3214−pi1324)((2−pi4231−pi1432−pi1243)(A)))
6 B ← (2− pi3214 − pi1324)((2− pi4231 − pi1432 − pi1243)(A))
7 B ← (2− pi2134 − pi1324)((2− pi4231 − pi1432 − pi1243)(A))
8 B ← (2− pi2134 − pi3214)((2− pi4231 − pi1432 − pi4231)(A))
9 B ← (2− pi4231 − pi1432)((2− pi3214 − pi1324 − pi4231)(A))
10 B ← (2− pi2134 − pi1432)((2− pi3214 − pi1324 − pi4231)(A))
11 B ← (2− pi2134 − pi4231)((2− pi3214 − pi1324 − pi1234)(A))
12 B ← (2− pi4231 − pi1243)((2− pi2134 − pi1324 − pi1432)(A))
13 B ← (2− pi3214 − pi1243)((2− pi2134 − pi1324 − pi1432)(A))
14 B ← (2− pi3214 − pi4231)((2− pi2134 − pi1324 − pi1432)(A))
15 B ← (2− pi1432 − pi1243)((2− pi2134 − pi3214 − pi4231)(A))
16 B ← (2− pi1432 − pi1243)((2− pi2134 − pi3214 − pi4231)(A))
17 B ← (2− pi1432 − pi1432)((2− pi2134 − pi3214 − pi4231)(A))
18 B ← (2− pi4231 − pi1432 − pi1243)(A)
19 B ← (2− pi3214 − pi1324 − pi1243)(A)
20 B ← (2− pi2134 − pi1324 − pi1432)(A)
21 B ← (2− pi2134 − pi3214 − pi4231)(A)
Table II: Spin summations. Depending on the context, “2” is short for 2pi123 or 2pi1234.
6. PERFORMANCE EVALUATION
Figure 4 summarizes the performance of the discussed algorithms across the set of
21 spin summations (see Table II). We make the following key observations: (1) Al-
gorithms 1–4 ( , , , ) show ambivalent performance for the different test cases
with no clear loser nor winner among them. For instance, Algorithm 1 performs well
on test cases 2, 3, 4, 19, and 20 but it greatly lags behind any of the Algorithms 2–4
for test cases 1, 5–17. (2) While Algorithm 3 ( ) is implemented in terms of Algo-
rithm 1 ( ), they exhibit significantly different performance results for all test cases
for which the FLOP-count can be reduced (i.e. 1, 5–17). For instance, the difference
between the attained bandwidth of Algorithms 1 and 3 is largest for test case 5; this
matches our expectations since this test case corresponds to the spin summation for
which the FLOP-count can be reduced the most. (3) Algorithm 5 ( ) is consistently
faster than any of its predecessors; its superior performance can be contributed mostly
to its improved temporal locality. (4) Algorithm 6 ( ), with all its optimizations (e.g. vec-
torization, regularization), yields a significant improvement over its blocked, but non-
vectorized, predecessor ( ) across all spin summations. (5) While test cases 6–17 re-
quire the same amount of FLOPS and memory accesses, they attain noticeably differ-
ent bandwidth. This is a similar observation that we have already made in a previous
publication [Springer et al. 2016], indicating that some transpositions are inherently
more difficult than others. For instance, test cases 4, 8, 11, 14–17, and 21 require an ad-
ditional regularization step (see Section 4.6.1) to enable vectorization. (6) The attained
effective floating-point performance (see Fig. 4b) greatly varies across the benchmark,
reflecting that some spin summations benefit from the FLOP-count optimization more
9The thread affinity is set via the environment variable KMP AFFINITY = compact,1.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:17
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#test case
0
10
20
30
40
50
G
iB
/s
3D 4D
Algo1 Algo2 Algo3 Algo4 Algo5 Algo6
(a) Bandwidth.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#test case
0
50
100
150
200
250
300
350
400
G
FL
O
P
S
/s
3D 4D
Algo1 Algo2 Algo3 Algo4 Algo5 Algo6
(b) Effective floating-point performance.
Fig. 4: Performance of all algorithms across the benchmark.
than others. More precisely, the theoretical FLOP-count reduction for test cases 1, 2–4,
5, 6–17, and 18–21 respectively is 2×, 1×, 4×, 2×, and 1×.
6.1. Optimizations
Figure 5 highlights the effect of various optimizations (see Sections 4.6.3–4.6.5) on
the performance of Algorithm 6. The reference for the speedups is Algorithm 6 with
all optimizations deactivated; similarly, the measurements labeled NTS, cache-opt ( )
correspond to the results in Fig. 4.
It is evident from Fig. 5 that: First, the cache optimization ( , see Section 4.6.4) only
has an effect on the performance for 4D spin summations; this is expected because 3D
spin summations only require a small portion of the available last level cache, making
cache conflicts less likely. Moreover, the positive effect of the cache optimization is
evident both in the presence or absence of non-temporal stores. Second, activating
non-temporal stores ( , see Section 4.6.3) is critical for performance. The speedup due
to non-temporal stores, in some case, can be as high as 1.5×; such a speedup is perfectly
in line with the fact that the amount of data transfered from the caches is reduced by
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:18 Paul Springer et al.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#test case
0.8
1.0
1.2
1.4
1.6
1.8
2.0
S
p
e
e
d
u
p
3D 4D
cache-opt NTS NTS, cache-opt NTS, cache-opt, in-place
Fig. 5: Impact of various optimization techniques on the performance of Algorithm 6.
Reference: Algorithm 6 without any optimizations. cache-optimization; non-
temporal stores; non-temporal stores and cache-optimization; all optimizations and
in-place.
33% (see Section 4.6.3).10 Finally, in-place spin summations ( , see Section 4.6.5) do
not only require less memory, they also increase the performance over its out-of-place
counter part ( ) by up to 1.2×.
6.2. Summary
Figure 6 summarizes the performance of Algorithm 6 for different problem sizes. Since
Algorithm 2 constitutes the current implementation of spin summation in the NCC
quantum chemistry software package [Matthews and Stanton 2015], Fig. 6b reports
the speedup of Algorithm 6 with respect to Algorithm 2.
The average speedups for the small ( ), medium ( ), and large ( ) problems respec-
tively vary between 2.4− 4.3×, 2.4− 3.8×, and 3.3− 5.5×. Thus, our high-performance
implementation substantially outperforms the reference for all spin summations and
tested problem sizes. While the speedups for the large problems are the highest, we
point out that these speedups are caused by the combination of (1) slightly increased
performance of Algorithm 6 (see Fig. 6a), and (2) decreased performance of Algorithm 2
(i.e. the reference).
7. CONCLUSION
We tackled the problem of making the computation of 3D and 4D spin summations
as efficient as possible. We presented a systematic way to restructure the memory ac-
cesses so that both the temporal and spatial locality, inherent to the operation, are
exploited; the resulting algorithm also takes advantage of the factorizablity of spin
summations, thus significantly reducing the required FLOP-count. The lesson learned
is that it is not one optimization in isolation that makes the difference, but rather
the combination of all of them. For instance, the main techniques introduced in Sec-
tion 4.1–4.4 (e.g. spatial locality, temporal locality, and reduced FLOP-count) show
ambiguous performance results if applied in isolation. However, the integration of all
these ideas into Algorithm 6, yields noticeable speedups, between 2.4× and 5.5× over
the current reference implementation. Furthermore, our algorithm allows for spin
summations to be performed in-place; this desirable property not only reduces the
10The maximum theoretical speedup of 1.5× assumes that the accessed data to both tensors is entirely
cached and that the write-allocate traffic is avoided.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:19
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#test case
0
10
20
30
40
50
G
iB
/s
small medium large
(a) Bandwidth.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
#test case
0
1
2
3
4
5
6
S
p
e
e
d
u
p
small medium large
(b) Speedup over Algorithm 2.
Fig. 6: Performance of Algorithm 6 for three different problem sizes: small (70MiB),
medium (320MiB), large (1200MiB).
memory footprint by a factor of two, but also reflects positively on the overall perfor-
mance.
We plan to incorporate our high-performance implementation of spin summation
based on Algorithm 6 into the NCC quantum chemistry software package. In concert
with other ongoing work in optimizing tensor contraction [Matthews 2016; Springer
and Bientinesi 2016a], the speedups achieved in this work should allow for large-scale
CCSDT, CCSDTQ, and CCSDT(Q) calculations to run at near-peak efficiency on mod-
ern multi- and many-core systems. This is exciting, given that previous experience has
been that these calculations only achieved a small fraction of peak performance.
As already mentioned, the sums of tensor transpositions appears in other operations
beyond spin summations. We believe the techniques detailed in this work are also
applicable for tensor unpacking, and for the anti-symmetrization operations required
in open-shell calculations using an unrestricted Hartree-Fock reference.
ACKNOWLEDGMENT
Financial support from the Deutsche Forschungsgemeinschaft (DFG) through grant GSC 111, and from the
Intel Corporation through Parallel Computing Center grants to RWTH Aachen and UT Austin is gratefully
acknowledged. Devin A. Matthews is an Arnold O. Beckman Postdoctoral Fellow and gratefully acknowl-
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:20 Paul Springer et al.
edges support from the Arnold and Mabel Beckman Foundation. Furthermore, we thank our colleagues in
the HPAC research group for many fruitful discussions and valuable feedback.
REFERENCES
B. G. Adams and J. Paldus. 1979. Orthogonally spin-adapted coupled-cluster theory for closed-shel systems
including triexcited clusters. Phys. Rev. A 20, 1 (1979), 1.
Roberto Ansaloni, Gian Luigi Bendazzoli, Stefano Evangelisti, and Elda Rossi. 2000. A
parallel Full-CI algorithm. Computer Physics Communications 128, 1 (2000), 496–515.
DOI:http://dx.doi.org/10.1016/S0010-4655(99)00542-1
Rodney J. Bartlett and Monika Musiał. 2007. Coupled-cluster theory in quantum chemistry. Reviews in
Modern Physics 79, 1 (2007), 291–352.
G. Baumgartner, A. Auer, D. E. Bernholdt, A. Bibireata, V. Choppella, D. Cociorva, Xiaoyang Gao, R. J.
Harrison, S. Hirata, S. Krishnamoorthy, S. Krishnan, Chi-chung Lam, Qingda Lu, M. Nooijen, R. M.
Pitzer, J. Ramanujam, P. Sadayappan, and A. Sibiryakov. 2005. Synthesis of High-Performance Paral-
lel Programs for a Class of ab Initio Quantum Chemistry Models. Proc. IEEE 93, 2 (2005), 276–292.
DOI:http://dx.doi.org/10.1109/JPROC.2004.840311
Yannick J. Bomble, John F. Stanton, Miha´ly Ka´llay, and Ju¨rgen Gauss. 2005. Coupled-cluster methods
including noniterative corrections for quadruple excitations. Journal of Chemical Physics 123 (Aug.
2005), 4101. DOI:http://dx.doi.org/10.1063/1.1950567
Randal Bryant, O’Hallaron David Richard, and O’Hallaron David Richard. 2003. Computer systems: a pro-
grammer’s perspective. Vol. 2. Prentice Hall Upper Saddle River.
RJ Buenker and S Krebs. 1999. The configuration-driven approach for multireference configuration interac-
tion calculations. Recent advances in multireference methods (1999), 1–29.
Bryan Catanzaro, Alexander Keller, and Michael Garland. 2014. A decomposition for in-place matrix trans-
position. ACM SIGPLAN Notices 49, 8 (2014), 193–206.
Siddhartha Chatterjee and S. Sen. 2000. Cache-efficient matrix transposition. (2000), 195–205.
DOI:http://dx.doi.org/10.1109/HPCA.2000.824350
J. Cˇı´zˇek. 1966. On the Correlation Problem in Atomic and Molecular Systems. Calculation of Wavefunction
Components in Ursell-Type Expansion Using Quantum-Field Theoretical Methods. J. Chem. Phys. 45,
11 (1966), 4256–4266. DOI:http://dx.doi.org/doi:10.1063/1.1727484
Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Pat-
terson, John Shalf, and Katherine Yelick. 2008. Stencil Computation Optimization and Auto-tuning
on State-of-the-art Multicore Architectures. In Proceedings of the 2008 ACM/IEEE Conference on Su-
percomputing (SC ’08). IEEE Press, Piscataway, NJ, USA, 4:1–4:12. http://dl.acm.org/citation.cfm?id=
1413370.1413375
C. H. Q. Ding. 2001. An optimal index reshuffle algorithm for multidimensional arrays and
its applications for parallel architectures. IEEE T. Parall. Distr. 12, 3 (2001), 306–315.
DOI:http://dx.doi.org/10.1109/71.914776
Zhengting Gan, Yuri Alexeev, Mark S. Gordon, and Ricky A. Kendall. 2003. The parallel implementation
of a full configuration interaction program. The Journal of Chemical Physics 119, 1 (2003), 47–59.
DOI:http://dx.doi.org/10.1063/1.1575193
Geoffrey C Goldbogen. 1981. PRIM: A fast matrix transpose method. IEEE Trans. Software Eng. 7, 2 (1981),
255–257.
Kazushige Goto and Robert A Geijn. 2008. Anatomy of high-performance matrix multiplication. ACM Trans-
actions on Mathematical Software (TOMS) 34, 3 (2008), 12.
John A Gunnels, Greg M Henry, and Robert A Van De Geijn. 2001. A family of high-performance matrix
multiplication algorithms. In Computational Science–ICCS 2001. Springer, 51–60.
M. Hanrath and A. Engels-Putzka. 2010. An efficient matrix-matrix multiplication based antisymmet-
ric tensor contraction engine for general order coupled cluster. J. Chem. Phys. 133, 6 (2010), 064108.
DOI:http://dx.doi.org/10.1063/1.3467878
A. Hartono, Q. Lu, T. Henretty, S. Krishnamoorthy, H. Zhang, G. Baumgartner, D. E. Bernholdt, M. Nooijen,
R. Pitzer, J. Ramanujam, and P. Sadayappan. 2009. Performance optimization of tensor contraction
expressions for many-body methods in quantum chemistry. J. Phys. Chem. A 113, 45 (2009), 12715–
12723. DOI:http://dx.doi.org/10.1021/jp9051215
Y. He and C H. Q. Ding. 2002. MPI and OpenMP paradigms on cluster of SMP architectures: The vacancy
tracking algorithm for multi-dimensional array transposition. In Supercomputing, ACM/IEEE 2002
Conference. 6–6. DOI:http://dx.doi.org/10.1109/SC.2002.10065
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
Spin Summations: A High-Performance Perspective A:21
T. Helgaker, P. Jørgensen, and J. Olsen. 2013. Molecular Electronic-Structure Theory (1 edition ed.). Wiley,
Chichester ; New York.
Intel Corporation. 2015. Intel 64 and IA-32 architectures optimization reference man-
ual. (2015). http://www.intel.com/content/dam/www/public/us/en/documents/manuals/
64-ia-32-architectures-optimization-manual.pdf
Wanyi Jiang, Yuriy G. Khait, and Mark R. Hoffmann. 2009. Configuration-Driven Unitary Group Approach
for Generalized Van Vleck Variant Multireference Perturbation Theory. J. Phys. Chem. A 113, 16 (2009),
4374–4380. DOI:http://dx.doi.org/10.1021/jp811082p
Jose L Jodra, Ibai Gurrutxaga, and Javier Muguerza. 2015. Efficient 3D Transpositions in Graphics Pro-
cessing Units. International Journal of Parallel Programming (2015), 1–16.
Michael Klene and Michael A. Robb. 2000. Parallel implementation of the CI-vector evalu-
ation in full CI/CAS-SCF. The Journal of Chemical Physics 113, 14 (2000), 5653–5665.
DOI:http://dx.doi.org/10.1063/1.1290014
Stanislaw A Kucharski and Rodney J Bartlett. 1992. The coupled-cluster single, double, triple, and quadru-
ple excitation method. The Journal of chemical physics 97, 6 (1992), 4282–4288.
Pai-Wei Lai, Huaijian Zhang, Samyam Rajbhandari, Edward Valeev, Karol Kowalski, and
P. Sadayappan. 2012. Effective Utilization of Tensor Symmetry in Operation Optimiza-
tion of Tensor Contraction Expressions. Procedia Computer Science 9 (2012), 412–421.
DOI:http://dx.doi.org/10.1016/j.procs.2012.04.044
Qingda Lu, Sriram Krishnamoorthy, and P Sadayappan. 2006. Combining analytical and empirical ap-
proaches in tuning matrix transposition. In Proceedings of the 15th international conference on Parallel
architectures and compilation techniques. ACM, 233–242.
Dmitry I Lyakh. 2015. An efficient tensor transpose algorithm for multicore CPU, Intel Xeon Phi, and NVidia
Tesla GPU. Computer Physics Communications 189 (2015), 84–91.
Gabriel Mateescu, Gregory H Bauer, and Robert A Fiedler. 2012. Optimizing matrix transposes using a
POWER7 cache model and explicit prefetching. ACM SIGMETRICS Performance Evaluation Review
40, 2 (2012), 68–73.
Devin A. Matthews. 2016. High-Performance Tensor Contraction without BLAS. CoRR abs/1607.00291
(2016). http://arxiv.org/abs/1607.00291
Devin A. Matthews, Ju¨rgen Gauss, and John F. Stanton. 2013. Revisitation of Nonorthogonal
Spin Adaptation in Coupled Cluster Theory. J. Chem. Theory Comput. 9, 6 (2013), 2567–2572.
DOI:http://dx.doi.org/10.1021/ct301024v
Devin A. Matthews and John F. Stanton. 2015. Non-orthogonal spin-adaptation of coupled cluster methods:
A new implementation of methods including quadruple excitations. The Journal of Chemical Physics
142, 6 (Feb. 2015), 064108. DOI:http://dx.doi.org/10.1063/1.4907278
John McCalpin and Mark Smotherman. 1995. Automatic benchmark generation for cache optimization of
matrix operations. In Proceedings of the 33rd annual on Southeast regional conference. ACM, 195–204.
John D. McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Comput-
ers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter (Dec.
1995), 19–25.
A. Nguyen, N. Satish, J. Chhugani, C. Kim, and P. Dubey. 2010. 3.5-D Blocking Optimization for Stencil
Computations on Modern CPUs and GPUs. In 2010 ACM/IEEE International Conference for High Per-
formance Computing, Networking, Storage and Analysis. 1–13. DOI:http://dx.doi.org/10.1109/SC.2010.2
Jozef Noga and Rodney J. Bartlett. 1987. The full CCSDT model for molecular electronic structure. The
Journal of Chemical Physics 86, 12 (June 1987), 7041–7050. DOI:http://dx.doi.org/10.1063/1.452353
David A Patterson and John L Hennessy. 2007. Computer organization and design. Morgan Kaufmann
(2007), 474–476.
V. R. Saunders and J. H. van Lenthe. 1983. The direct CI method. Molecular Physics 48, 5 (1983), 923–954.
DOI:http://dx.doi.org/10.1080/00268978300100661
Paul Springer and Paolo Bientinesi. 2016a. Design of a high-performance GEMM-like Tensor-Tensor Multi-
plication. CoRR (2016). http://arxiv.org/abs/1607.00145
Paul Springer and Paolo Bientinesi. 2016b. TTC: A Tensor Transposition Compiler for Multiple Architec-
tures. Array 16 (2016).
Paul Springer, Jeff R. Hammond, and Paolo Bientinesi. 2016. TTC: TTC: A high-performance Compiler for
Tensor Transpositions. CoRR (2016). http://arxiv.org/abs/1603.02297
Paul Springer, Tong Su, and Paolo Bientinesi. 2017. HPTT: A High-Performance Tensor Transposition C++
Library. CoRR (2017). https://arxiv.org/abs/1704.04374
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
A:22 Paul Springer et al.
J. F. Stanton, J. Gauss, M. E. Harding, P. G. Szalay, A. A. Auer, R. J. Bartlett, U. Benedikt, C. Berger,
D. E. Bernholdt, Y. J. Bomble, L. Cheng, O. Christiansen, M. Heckert, O. Heun, C. Huber, T.-C. Jagau,
D. Jonsson, J. Jusełius, K. Klein, W. J. Lauderdale, D. A. Matthews, T. Metzroth, L. A. Mu¨ck, D. P.
O’Neill, D. R. Price, E. Prochnow, C. Puzzarini, K. Ruud, F. Schiffmann, W. Schwalbach, S. Stopkowicz,
A. Tajti, J. Va´zquez, F. Wang, and J. D. Watts. 2014. CFOUR, Coupled-Cluster techniques for Com-
putational Chemistry, a quantum-chemical program package with the integral packages MOLECULE
(J. Almlo¨f and P. R. Taylor), PROPS (P. R. Taylor), ABACUS (T. Helgaker, H. J. Jensen, P. Jørgensen
and J. Olsen), and ECP routines by A. V. Mitin and C. van Wu¨llen. (2014). For the current version, see
http://www.cfour.de.
Marin van Heel. 1991. A fast algorithm for transposing large multidimensional image data sets. Ultrami-
croscopy 38, 1 (1991), 75–83.
Field G. Van Zee and Robert A. van de Geijn. 2015. BLIS: A Framework for Rapidly Instantiating BLAS
Functionality. ACM Trans. Math. Software 41, 3 (2015), 14:1–14:33. http://doi.acm.org/10.1145/2764454
Andrey Vladimirov. 2013. Multithreaded transposition of square matrices with common code
for Intel Xeon processors and Intel Xeon Phi coprocessors. (2013). https://www.researchgate.
net/profile/Andrey Vladimirov/publication/258048110 Multithreaded Transposition of Square
Matrices with Common Code for Intel Xeon Processors and Intel Xeon Phi Coprocessors/links/
00463526c2fa40a1f3000000.pdf
Lai Wei and John Mellor-Crummey. 2014. Autotuning Tensor Transposition. In Parallel & Distributed Pro-
cessing Symposium Workshops (IPDPSW), IEEE International. IEEE, 342–351.
ACM Transactions on Mathematical Software, Vol. V, No. N, Article A, Publication date: 2017.
