Quaternion symmetry is ubiquitous in the physical sciences. As such, much work has been afforded over the years to the development of efficient schemes to exploit this symmetry using real and complex linear algebra. Recent years have also seen many advances in the formal theoretical development of explicitly quaternion linear algebra with promising applications in image processing and machine learning. Despite these advances, there do not currently exist optimized software implementations of quaternion linear algebra. The leverage of optimized linear algebra software is crucial in the achievement of high levels of performance on modern computing architectures, and thus provides a central tool in the development of high-performance scientific software. In this work, a case will be made for the efficacy of high-performance quaternion linear algebra software for appropriate problems. In this pursuit, an optimized software implementation of quaternion matrix multiplication will be presented and will be shown to outperform a vendor tuned implementation for the analogous complex matrix operation. The results of this work pave the path for further development of high-performance quaternion linear algebra software which will improve the performance of the next generation of applicable scientific applications.
INTRODUCTION
In the ever evolving ecosystem of high-performance computing (HPC), the full exploitation of contemporary computational resources must constitute a central research effort in computationally intensive fields such as scientific computing. However, it is often the case that simply applying conventional algorithms and data structures to existing problems will yield sub-optimal time-tosolution and resource management on modern HPC systems. By exploiting the symmetry of a particular problem and explicitly considering the structure and resources of these computational architectures, one can often develop more optimal computational research pathways. Such development has the potential to enable routine inquiry and simulation of systems which were inaccessible or impractical by existing computational methods.
In this work, we consider the computational benefits of exploiting the scalar and linear algebras generated by the quaternion numbers. The quaternions, also known as Hamilton's quaternions, are a hyper-complex number system which extends the complex numbers and are isomorphic with the special unitary group, SU(2) [Hamilton 1866] . Perhaps the most notable feature of the quaternion numbers is the loss of scalar commutivity under multiplication. As such, the algebra which is generated by the quaternions is peculiar in that it more closely resembles that of matrix algebra than that of its real or complex counterpart. Since their inception, the quaternions have seen extensive application both in pure [Cayley 1845; Chevalley 1996; Frobenius 1878; Hopf 1931; Hurwitz 1922 ] and applied mathematics [Arnol'd 1995; Deavours 1973; Grubin 1970; Mocoroa et al. 2006; Sudbery 1979] . Historically, the quaternions have been applied most successfully in the treatment of rigid body mechanics due to their relationship with SU(2) (and thus the group of spacial rotations, SO(3)) [Grubin 1970; Mocoroa et al. 2006] . This application has been widespread in the field of computer graphics to accelerate video animation via algorithms such as Slerp [Shoemake 1985] . From a computational perspective, quaternion arithmetic offers an attractive alternative to complex arithmetic in that it admits a higher arithmetic intensity and smaller memory footprint than that of the the complex matrix algebra it represents. A demonstration of this state of affairs will be given in the body of this work.
In the context of linear algebra, quaternions and quaternion linear algebra naturally manifest in many scientific applications such as quantum chemistry and nuclear physics [Armbruster 2017; Ekström et al. 2006; Henriksson et al. 2005; Konecny et al. 2018; Nakano et al. 2017; Peng et al. 2009; Saue et al. 1997; Saue and Helgaker 2002; Visscher and Saue 2000] . Traditionally, quantum mechanics and quantum field theories have been formulated in terms of the complex numbers. However, since the earliest days of the development of quantum mechanics, it has been known that the spinor nature of the fermionic wave function (i.e. electrons, neutrinos, etc) admits a quaternion representation due to its relationship with SU(2) [Adler 1995; Horwitz and Biedenharn 1984] . Typically, this representation manifests as a result of time-reversal being a global symmetry of the Hamiltonian [Dongarra et al. 1984; Saue et al. 1997; Stuber and Paldus 2003] . As such, the quaternion algebra has been exploited in the development of efficient real [Dongarra et al. 1984] and complex [Shiozaki 2017 ] eigensolvers for time-reversal symmetric Hamiltonians. Despite the success and power of these methods, their exploitation of the quaternion algebra is implicit in that the final computer implementation of these methods is done in either real or complex matrix arithmetic; thus they cannot leverage the full computational potential of the quaternion arithmetic. In this work, a case will be made for explicit exploitation of quaternion arithmetic in high-performance software.
In general, high-performance methods for scientific applications rely on highly tuned numerical linear algebra software libraries which implement the BLAS [Blackford et al. 2002; Dongarra et al. 1990 Dongarra et al. , 1988 Lawson et al. 1979] and LAPACK [Anderson et al. 1999 ] standards for performance on modern HPC systems. Historically, numerical linear algebra has been the archetypal example and motivation for the careful consideration of a computer's architecture in the development of high-performance software [Agarwal et al. 1994; Goto and van de Geijn 2008; Gunnels et al. 2001; Whaley and Dongarra 1998; Whaley et al. 2001] . It was realized early on that straightforward implementations of operations such as matrix multiplication will yield sub-optimal performance results and that achieving peak performance on modern architectures requires a drastic departure from conventional implementations. We refer the reader to the work of [Goto and van de Geijn 2008] for a reasonably contemporary discussion on the optimization of BLAS functions, specifically matrix-matrix multiplication, on modern architectures. BLAS and LAPACK optimization still constitutes a major research effort in the field of numerical linear algebra, and this has led to a number of different approaches which are available in open source Van Zee and Smith 2017; Van Zee et al. 2016; Van Zee and van de Geijn 2015; Wang et al. 2013; Whaley and Dongarra 1998; Xianyi et al. 2012] and vendor tuned software such as the Intel Math Kernel Library (MKL) and the IBM Engineering Scientific Subroutine Library (ESSL).
We note that over the years, there have been many important theoretical developments in the field of quaternion linear algebra [Rodman 2014; Zhang 1997] . Many of necessary algorithmic building blocks for ubiquitous operations such as eigenvalue and singular value decompositions, and matrix factorizations have been developed [Baker 1999; Bunse-Gerstner et al. 1989; Janovská and Opfer 2003; Jia et al. 2018; Li et al. 2019; Loring 2012; Zhang 1997] . Such explorations have been key in the development of recent methods for efficient signal and image processing by exploiting the quaternion algebra [Ell et al. 2014; Le Bihan and Mars 2004; Le Bihan and Sangwine 2003; Zeng et al. 2016] . Despite these successes, the field of quaternion linear algebra is still in its infancy in terms of software adoption by scientists and engineers. This is primarily due to the fact that, relative to its complex counterpart, there do not currently exist highly optimized software implementations of quaternion linear algebra. In order for quaternion linear algebra to become a viable alternative to complex linear algebra, such software must be developed. In this work, it will be demonstrated that optimized implementations of quaternion linear algebra operations hold the potential for leveraging drastic performance improvements relative to analogous complex operations in problems which they may be applied.
A key building block of optimized linear algebra algorithms is that of an optimized implementation of matrix multiplication, which we will refer to as GEMM in this work. Having an optimized GEMM implementation is a necessary (but not necessarily sufficient) condition for optimization more involved linear algebra operations such as eigenvalue decomposition and matrix factorization. As such, this work will focus on the performance and implementation of a quaternion GEMM to demonstrate the efficacy of high-performance quaternion linear algebra.
The remainder of this work will be organized as follows. Section 2 will review the necessary theory for the development of quaternion linear algebra and how one might leverage this algebra as an alternative to complex linear algebra for a special class of complex matrices. Section 3 will briefly review the nature and abstract structure of high-performance GEMM implementations. Section 4 details an implementation of a high-performance GEMM operation using explicitly quaternion arithmetic on a contemporary computing architecture. Finally, performance results for the implementation quaternion GEMM relative to a vendor tuned complex implementation will be presented in Sec. 5, and Sec. 6 will provide an examination of the future research which will be enabled as a result of our findings.
Notation
Throughout the remainder of this work, we will adopt the following notation conventions:
(1) Scalars of a ring F will be denoted with lower case letters a ∈ F.
(2) M M, N (F) will denote the ring of M-by-N matrices over F. Further, M N (F) ≡ M N , N (F) for brevity. Matrices will be denoted with capital letters, A ∈ M M, N (F). (3) For A ∈ M M, N (F), A will denote the conjugate of A, A T its transpose, and A * = A T will denote its conjugate transpose. For a ∈ F, we note that a ≡ a * . (4) For N = 1, we denote the set of vectors over a ring as V M (F), and denote its elements as lower-case letters,
represents the sub-matrix consisting of the µ 1 -st to µ 2 -nd rows of A. If µ 2 − µ 1 is to be understood from the context, we use the abbreviated notation
represents the sub-matrix consisting of the ν 1 -st to ν 2 -nd columns of A. If ν 2 − ν 1 is to be understood from the context, we use the abbreviated notation
The limiting cases of single row or column vectors will be denoted ì a (µ) ≡ A(µ, :) and ì a (ν ) ≡ A(:, ν ). (8) If the indices µ 3 and µ 4 are to be understood from the context, we denote the sub-matrix of a sub-matrix as A (µ 1 , µ 3 ) = A (µ 1 ) ([µ 3 , µ 4 ], :), and so on for row sub-matrices.
THE QUATERNION ALGEBRA

Scalar Operations
Fundamental to the development of quaternion linear algebra is the nature of the scalar algebra generated by the quaternions. In this work, the set of quaternion numbers will be denoted H and is defined as the set of all q that [Hamilton 1866]
where q 0 is referred to as the scalar component of q and {q 1 , q 2 , q 3 } is referred to as its vector component. {e 0 , e 1 , e 2 , e 3 } are the quaternion basis elements and they generate the skew-symmetric algebra defined by,
e 0 e j = e j e 0 = e j , j ∈ {0, 1, 2, 3},
where δ is the Kronecker delta and ε is the totally anti-symmetric Levi-Civita tensor. As such, the following must hold true e i e j = −e j e i , i, j ∈ {1, 2, 3}, i j, (3a) e 1 e 2 e 3 = −e 0 .
Given Eq. (2), the product of quaternion scalars p, q ∈ H is given by the Hamilton product
and is thus non-commutative
There are a number remarkable results that arise from the algebra defined by Eqs.
(1) and (2) which are important to the construction of quaternion linear algebra. The first is that H constitutes a normed division algebra, and is in fact the largest normed division algebra for which multiplication is associative [Frobenius 1878] . As such, we may define a quaternion norm and inverse for every nonzero element of H,
High-Performance Quaternion Matrix Multiplication 0:5 where we have defined the quaternion conjugate
We note here that quaternions of unit norm (∥q∥ = 1) are known in the literature as versors [Hamilton 1866]. In examining Eqs. (6) and (7), the expressions for norm, inverse and conjugate closely resemble those of the complex numbers, C. In fact, C is a sub-algebra embedded in H, and this relationship is crucial for the development of the relationship between complex and quaternion linear algebra. To examine the relationship between C and H, we introduce a common, simplified notation
where
Consider the subset C ⊂ H defined by
Note that q 0 , q 1 ∈ C. The algebra defined by C is exactly that of C (this may be easily verified through expansion of Eq. (4)). Thus, C C via the map
with q ∈ C and z ∈ C. It is important to note here that Eq. (11) does not imply e 0 = 1 nor e 1 = i, simply that there exists a bijection between C and C. Keeping this in mind, however, it will typically be the case that one can use them interchangeably without ambiguity. As such, whether scalars of the form given in Eq. (10) are treated as complex or quaternion scalars should will be implied from their context in the following discussion. While one tends to describe quaternions in terms of scalars (and rightly so), the algebra which they generate more closely that of a matrix algebra, specifically that of a Lie group [Hall 2015] . In the development of quaternion linear algebra, it is instructive to examine the isomorphism between the versors and the special unitary group SU(2) through the mapping of basis elements
where the Pauli matrices are given as
By expanding in terms of the Pauli basis, it may be demonstrated that the algebra of H is isomorphic to the algebra generated by ⟨SU(2)⟩ ⊂ M 2 (C) via the map
Here we have denoted the complex matrix representation of a quaternion scalar with a subscript C. While the representation given in Eq. (14) may seem inconsequential, it demonstrates the great potential for improving computational performance by exploiting quaternion arithmetic. As their 
Although the result of both the quaternion and complex arithmetic may be thought of as to represent the same mathematical object (up to an isomorphism), the computational work required for these operations is different for the two arithmetics, respectively. Table 1 summarizes the number of real floating point operations (FLOPs) required for the operations given in Eq. (15). In this work we adopt the convention that the operation
constitutes a single FLOP. From Tab. 1, we can see that there is a 2x reduction in FLOPs for Harithmetic over generic M 2 (C)-arithmetic for the same mathematical operation. We may further note that there is also a 2x reduction in memory operations (MOPs) and computational storage requirements between H and M 2 (C) data structures. This fact will prove important in the following developments of high-performance quaternion linear algebra. As H forms a normed, associative division algebra, it is natural to consider quaternion linear algebra, i.e. the algebra generated by matrices and vectors of quaternion elements, as an extension of the discussion presented in this subsection. In the following subsection, we briefly review the relevant theory to motivate the usage of explicitly quaternion linear algebra for software implementation.
Matrices of Quaternions and Quaternion Linear Algebra
Consider the space of M × N matrices with quaternion elements denoted M M, N (H) and given by the generic element Q ∈ M M, N (H) [Zhang 1997 ],
Note the similarity of Eq. (17) with Eq. (1). In analogy with M M, N (C), we may define conjugate and conjugate transpose operations on
In the same manner as M N (C), we define quaternion hermiticity as Q = Q * . Further, we may define a scalar and matrix product operations for Zhang 1997 ].
However, unlike real and complex matrices, the loss of scalar commutivity in H dictates that we must also consider operations of the form
where, in general, qQ Qq and qPQ PqQ PQq. In addition, generally PQ QP, however this is in perfect analogy with real and complex linear algebra. As one may intuitively guess, this loss of scalar commutivity greatly complicates proofs and algorithm development in quaternion linear algebra [Rodman 2014; Zhang 1997] , often requiring researchers to resort to rather complex and abstract mathematical paradigms, such as algebraic topology [Baker 1999; Zhang 1997] , to obtain the desired outcomes. Despite these complications, it is possible to extend operations which are important to scientific application, such as matrix inversion [Loring 2012; Zhang 1997] and eigenvalue decomposition [Baker 1999; Bunse-Gerstner et al. 1989; Jia et al. 2018; Li et al. 2019; Zhang 1997] , to M N (H). In particular, the set of all invertable quaternion matrices forms a group under the matrix product [Zhang 1997 ].
Just as H admits a close relationship with C and M 2 (C), analogous relationships may be developed
We may define an analogous expression to Eq.
To construct its relationship to M 2M,2N (C), we examine the M 2 (C) representation of a quaternion matrix element,
Thus Eq. (24) may be written in terms of a Kronecker product:
where we have denoted the complex matrix representation of the quaternion matrix with a subscript C in analogy with Eq. (14). In analogy to Eq. (15), the isomorphism between M M, N (H) and M 2M,2N (C) admits the following relationships
As in Eq. (15), the amount of computational work required to perform the operations in Eq. (26) in quaternion and complex arithmetic are different. As an extension of Tab. 1, Tab. 2 summarizes 
differences in the the number of FLOPs required for the same algebraic operation in the two arithmetics, respectively. For simplicity and brevity, the summary in Tab. 2 only accounts for M N (H) and M 2N (C), though completely analogous results hold for the general rectangular case. Just as in Tab. 1, there is a 2x reduction in both FLOPs and MOPs in utilizing explicitly quaternion arithmetic and data structures over the analogous complex operations. However, this comparison is in terms of a ratio of computational work requirements. In terms of raw differences between the two arithmetics, the potential computational savings scale to some power of the dimension the matrix in question. For example, the difference in the number of FLOPs required for quaternion / complex matrix multiplication is 16N 3 . For small N this would not make a drastic difference, but for large N , this difference becomes significant. Due to the fundamental and central importance of matrix multiplication in numerical linear algebra, similar comparisons could be made for any matrix operation, such as eigenvalue decomposition or matrix factorization, between complex and quaternion arithmetic.
HIGH-PERFORMANCE MATRIX-MULTIPLICATION
The cornerstone of high-performance numerical linear algebra software is the optimized implementation of general matrix-matrix multiplication (GEMM). Without an optimized GEMM implementation, operations such as eigenvalue decomposition and matrix inversion become impractical for large matrices. Thus, the first step in the development of high-performance quaternion linear algebra is the development of an optimized quaternion GEMM. Over the past several decades, an enormous amount of research effort in the fields of numerical linear algebra and HPC has been directed towards the development of optimized GEMM operations on various computing architectures. As a result, many different strategies have been developed for high-performance GEMM implementations [Goto and van de Geijn 2008; Gunnels et al. 2001; Van Zee and van de Geijn 2015; Wang et al. 2013; Whaley and Dongarra 1998; Xianyi et al. 2012] . Despite their differences, the common motif among these methods is the rejection of a "one-size-fits-all" development strategy for all computing platforms, i.e. one must explicitly consider and optimize for the underlying features of the computer architecture in question to reach optimal performance. In modern HPC, there are effectively three fundamental aspects of computing architectures which must be considered in the development of optimized GEMM operations [Goto and van de Geijn 2008] :
(1) Efficient and effective utilization of various levels of the computational data and instruction caches. (2) Utilization of microarchitechture specific features such as single instruction multiple data (SIMD) and fused multiply-add (FMA) operations. (3) Achieving efficient parallelism on modern multi-core and many-core computing architectures. To demonstrate the efficacy of quaternion GEMM, we will only consider the former two of these features; leaving the treatment of parallelism for future work. Due to its relative simplicity and portability to general architectures, the development of high-performance quaternion GEMM operations in this work will extend the strategy adopted by the BLIS library for real and complex GEMM operations Van Zee and Smith 2017; Van Zee et al. 2016; Van Zee and van de Geijn 2015] . In the BLIS strategy, the aspects of the GEMM operation which must be explicitly optimized for a specific architecture are factored into a manageably small set of auxiliary procedures, referred to as kernels, while the general scaffold for the GEMM remains consistent between architectures. Further, the structure and function of the kernels yielded by this strategy are designed in such a way that they may be used in the implementation of other BLAS-3 functionality such as rank-k updates (XSYRK), triangular matrix multiplication (XTRMM), etc. In this section, we examine the salient aspects of this strategy which are agnostic to the data representation and arithmetic operations relating to the matrices being multiplied.
The General Algorithm
Consider the GEMM of two matrices
where α, β ∈ F and C ∈ M M, N (F). Computationally, A, B and C are stored as linear, contiguous data structures of lengths MK, KN and MN , respectively. In this work, we will consider column-major storage of matrices, i.e. A µκ and A (µ+1)κ , and A Mκ and A 1(κ+1) are stored contiguously in memory. For comparison in the following, Algorithm 1 outlines the simplest implementation of the GEMM operation which was suggested in the earliest developments of the BLAS standard [Dongarra et al. 1990] . This method will be referred to as the reference GEMM algorithm. To fully understand the drawbacks of Algorithm 1 and to motivate the development of a more optimal algorithm, we must examine that nature of the memory hierarchy on modern computers. Figure 1 illustrates a simplified model of a representative memory hierarchy on a modern computer [Goto and van de Geijn 2008] . At the top of the hierarchy, the fastest and least abundant memory resource are the registers which physically reside on the processor. It is on the data which resides in the registers that the processor may issue instructions such as arithmetic operations, etc. On architectures which support SIMD instructions, i.e vector processors with instruction sets such as SSE, AVX, AVX2 and AVX-512, each floating point register can hold a small number of floating point numbers at a time, typically between 2 and 16. However, their abundance is very limited: 16 registers on SSE, AVX and AVX2, and 32 on AVX-512. Thus, to fully exploit the speed of the registers, care must be taken to carefully populate the data which resides there to minimize the data movement between the registers and other levels of the hierarchy.
At the bottom of the hierarchy is the slowest and largest memory resource: the random access memory (RAM). It is in the RAM that the matrices which participate in the GEMM operation are typically stored. The RAM is the memory resource that resides furthest from the processor, which allows it to be orders of magnitude larger than any of the other memory resources (on the order of 1GB-1TB). However, the penalty for its size and distance from the processor is a high access latency. The speed at which data can be moved to and from RAM varies drastically between different architectures and manufacturers, and also depends on factors such as how the data being moved is laid out in memory (contiguous, strided, cache aligned, etc). Generally, reading to and from RAM amounts to hundreds of clock cycles on modern processors. Due to its very high latency, data movement in and out of RAM must be kept to a minimum to achieve optimal performance. As such data is rarely read directly from RAM to the registers or visa versa. Instead, it is typically the case that data is read to and from the RAM to a low latency intermediary storage, known as the data cache, which resides closer to the processor and is thus capable of moving data to and from the registers much faster than would be possible from the RAM.
Due to the slow rate at which RAM may be accessed, typical memory access patterns dictate that the RAM should be read in large chunks of contiguous data into the cache. Whenever a read instruction is issued from the processor for a particular memory address in RAM, it first checks if that data resides in cache. If the data resides in cache, what is referred to as a cache hit, it may be read directly from cache and avoid the RAM completely. However, if the data does not reside in cache when the instruction is issued, the data must be moved from RAM to cache and then read into the registers. This process is referred to as a cache miss. Due to the large latency differential between RAM and cache, the penalty for a cache miss can often be quite large. Further, the restricted size of the cache only allows a limited amount of data can be stored there at any point in time. When the cache reaches its capacity, data which resides in the cache must be replaced when new data is read in from the RAM. The process by which this replacement happens is referred to as the cache's replacement policy. If data is to be often reused in an algorithm, it is important to ensure that it resides in the cache as often as possible to minimize the probability of a cache miss. Thus, knowledge of the replacement policy is paramount in the development of a strategy for cache
Store ì c (ν ) end population to maximally reuse the data the resides there while not ejecting reusable data with data which is to be used less often.
On contemporary architectures, the cache is divided into cascading "levels": the L1, L2, L3 caches, etc. The capacity and access latencies for the cache levels vary considerably between processor generations and manufacturers; however, the general trend is to lose an order of magnitude on access latency and gain an order of magnitude in capacity between successive cache levels. For example, the Intel(R) Xeon(R) CPU E5-2660 (Sandy Bridge) processor yields cache capacities of 32 kB, 256 kB, 20 MB and access latencies of 4, 12 and 29 clock cycles for the L1, L2 and L3 caches, respectively [Fog 2012] . It is typically the case that the population of the different levels of cache cannot be explicitly programmed; one typically relies on heuristics issued by the CPU, such as data prefetching and cache replacement policies, to perform this population. However, with knowledge of the sizes of the cache levels and replacement policies, one may develop algorithms which aim to populate these caches optimally for data reuse.
From the perspective of effective utilization of the memory hierarchy and the other aforementioned features of computing architectures, there are a number of drawbacks in Algorithm 1:
• All of A is loaded into cache for each column of C, • For large M, K, loading A potentially ejects ì c (ν ) from cache, triggering a cache miss on each update of ì c (ν ) , • There is no useful caching of B, • In a high-level programming language, this algorithm relies on an optimizing compiler to utilize SIMD, FMA, etc.
• Scalable parallelism is non-trivial.
Algorithm 1 is referred to as a memory bound algorithm, i.e. its performance is completely determined by the latency at which data may be moved to and from the RAM. As such, even for relatively small GEMM operations, performance will be sub-optimal [Goto and van de Geijn 2008] . A demonstration of this state of affairs in the context of quaternion GEMM will be given in Sec. 5.
In order to overcome the memory bottle neck, one must develop an algorithm which populates the levels of cache and registers with sub-matrices of A, B and C according how their data may be reused throughout the GEMM operation. For a detailed explanation of the extent to which one may reuse different sub-matrices of A, B and C, we refer the reader to the work of [Goto and van de Geijn 2008] . In general, the mechanism by which one achieves optimal cache utilization is through a layered approach to the GEMM operation [Goto and van de Geijn 2008; Gunnels et al. 2001; Van Zee and van de Geijn 2015; Whaley and Dongarra 1998 ]. An optimized layered GEMM algorithm may be constructed through the specification of three caching parameters: M c , N c , K c ∈ Z + , two register blocking parameters: N r , M r ∈ Z + , two packing kernels: PACK1, PACK2, and a microkernel, KERN. A representative example of such an algorithm, specifically the algorithm which has been proposed in the development of the BLIS framework [Van Zee and Smith 2017; Van Zee et al. 2016; Van Zee and van de Geijn 2015] , is outlined in Algorithm 2. For simplicity in Algorithm 2, we have assumed
However, extension of Algorithm 2 without these constraints is straightforward through zero padding in the packing kernels [Van Zee and van de Geijn 2015] . We note for clarity that the scaling by α in Line 10 of Algorithm 2 may instead be performed in Line 7 for rings F which admit scalar commutivity in the sense of Eq. (20b) (i.e. R and C). Each of these parameters and kernels must be carefully chosen and optimized for each computer architecture of interest. In the following subsection, we examine the nature of each of these moieties and the factors one must consider in their selection.
0:12 David B. Williams-Young and Xiaosong Li
Algorithm 2: General Layered GEMM Algorithm
) (L3 cache)
) (L2 cache) 
Output : Partially updated C r 1 Load C r into registers.
for κ = 1 :
Load ì a r (κ) and ì b r (κ) from ì a p and ì b p into registers.
Register Blocking and The Microkernel
Consider the expression of a specific sub-matrix 
In other words, C r may be expressed as a sum of rank-1 updates over rows and columns of B r and A r , respectively. As this is the fundamental arithmetic operation of the GEMM operation to be performed by the CPU, ì a r (κ) , ì b r (κ) and C r must all reside in the registers for the operation to take place. C r is referred to as the register block of C, with dimensions N r = i 2 − i 1 and M r = j 2 − j 1 . To achieve optimal memory performance, N r and M r must be chosen such that ì a r (κ) , ì b r (κ) and C r may reside in the registers simultaneously in order to avoid data movement between the registers and other levels of the memory hierarchy [Goto and van de Geijn 2008] .
In Algorithm 2, the full product, C, is constructed by successively updating each of its (disjoint) M M r , N r (F) sub-matrices via partial summation (over K c elements) of Eq. (28) with the microkernel performing arithmetic operations which amount to the sum over rank-1 updates. As the arithmetic kernel of the GEMM operation, the microkernel is the fundamental operation which is most sensitive to the underlying computer architecture and is a key factor in the performance of the GEMM implementation. It is in the microkernel that one must explicitly consider microarchitechture specific operations such as SIMD and FMA. As such, optimized GEMM implementations typically do not express the microkernel in a high-level language; it is typically expressed directly in assembly language [Goto and van de Geijn 2008 ; Van Zee and van de Geijn 2015; Whaley and Dongarra 1998] or with use of low-level access paradigms such as vector intrinsics in C++. An abstract template for a generic microkernel implementation is given in Algorithm 3.
There is a subtle, yet crucial aspect of the loop expressed in Algorithm 3 in relationship to Algorithm 2: as all of the arithmetic intensity is folded into the rank-1 updates performed from within the microkernel inner-loop, optimality of the GEMM operation is directly related to the amount of time spent in this loop. In other words, the number of operations performed inside of this loop, whether they be FLOPs or MOPs, must be kept to a minimum to achieve optimal performance. The number of FLOPs required to perform the rank-1 update is fixed based on M r , N r and F, thus optimality is generally achieved through minimizing the number of MOPs performed inside this inner loop. To this end, the microkernel utilizes packed representations,Ã p andB p , of sub-matrices, A r and B r , produced by the packing kernels, PACK1 and PACK2, respectively. The remainder of this section is dedicated to the design and optimization of the packing kernels and caching parameters to achieve optimal data movement between levels of the memory hierarchy and to minimize the number of MOPs required to be performed from within the microkernel.
Sub-matrix Packing for Optimal Data Layout and Cache Utilization
Perhaps the most ingenious aspect of the layered GEMM algorithm outlined in Algorithm 2 is the utilization of auxiliary memory and packing kernels to amortize the cost of data manipulation over the movement of data between the levels of the memory hierarchy [Goto and van de Geijn 2008] . This packing strategy has two primary objectives:
(1) To populate the various levels of the cache with sub-matrices of A and B according to the extent which they will be reused in the GEMM operation as to minimize probability of triggering cache misses, (2) To ensure optimal, contiguous data layouts of the packed sub-matrices to minimize the number of operations (FLOPs and MOPs) which must be performed from within the inner loop of the microkernel.
In the following, we will examine both of these objectives in turn.
To optimize data movement for cache utilization, one must obtain optimal choices for the caching parameters M c , N c and K c for the architecture of interest. Typically, these parameters are chosen such that [Goto and van de Geijn 2008; Van Zee and van de Geijn 2015] :
• Contiguous storage of size N c K c may reside in and be addressed from the L3 cache once the data is loaded from RAM (e.g.
) until it is no longer needed.
• Contiguous storage of size M c K c may reside in and be addressed from the L2 cache once the data is loaded from RAM (e.g.
• Contiguous storage of size K c N r may be moved from the L3 to the L1 cache without triggering a cache miss or cache invalidation (e.g.
Clearly, the choice of these parameters are integrally tied to the sizes of the L1, L2 and L3 caches and the size of the data structure which represents F. Several methods exist for determining optimal choices for the caching parameters. There has been work in the development of analytical models and formulas which take into account the specifics of F and the architecture in question and return optimal values for the caching parameters . Other approaches utilize guided or black-box optimization [Wang et al. 2013; Whaley et al. 2001; Xianyi et al. 2012] , to obtain these parameters. Once these parameters have been determined, the task then becomes to develop efficient packing utilities which optimize the data layout for use with the microkernel. There are a number of desirable features one wishes to express in the data layout of packed matrices,Ã p andB p , to optimize the data movement between the levels of cache and the registers from within the microkernel:
• The elements of ì a r (κ) and ì b r (κ) should be contiguous, respectively. As vectors, this amounts to ensuring ì a r, µ+1 are contiguous in memory, and similarly for ì b r (κ) .
• The elements ofÃ p andB p which contribute to adjacent register blocks of C should be contiguous in memory, i.e. ì a p (i r ) and ì a p (i r +1) should be contiguous in memory.
• For F which is represented by a compound datatype of primitive data, e.g. C and H, the primitive data for contiguous datastrutures which contain elements of type F should be Algorithm 4: Abstract Template for the PACK1 Kernel
arranged into a data layout which allows for a minimum number of MOPs to be performed from within the microkernel, as long as map between the standard and new data layout is space preserving.
To demonstrate what is meant by a space preserving map in this context, consider an complex element, z = a + bi ∈ C, which is represented by two primitive real numbers a, b ∈ R which are contiguous in memory, denoted [a; b]. For a datastructure which contains two contiguous elements z 1 , z 2 ∈ C, the data layouts [a 1 ; b 1 ; a 2 ; b 2 ] and [a 1 ; a 2 ; b 1 ; b 2 ] occupy the same space in memory. Thus a map between these two data layouts would be considered space preserving. While the first two aspects of data packing are well explored in the literature, the latter has not to the best of authors' knowledge. As will be demonstrated in Sec. 4, optimizing the primitive data layout of contiguous quaternion datastructures will prove important in the development of an optimized quaternion GEMM.
The fact that the rank-1 updates required by Eq. (28) and Algorithm 3 involve both row and column vectors, a single packing strategy would not be sufficient to achieve optimal data layout for bothÃ p andB p . Thus, the packing kernels PACK1 and PACK2 must be designed separately to optimize the layouts ofÃ p andB p , respectively. An abstract templates for these packing kernels are given in Algorithms 4 and 5, respectively. We refer the reader to the work of Van Zee, et al [Van Zee and van de Geijn 2015] for an intuitive graphical illustration of the optimal packing procedure. To account for the rearrangement of primitive data in the packing procedure, we have introduced two additional operations, PACKOP1 and PACKOP2, to perform this operation for the kernels PACK1 Algorithm 5: Abstract Template for the PACK2 Kernel
) end end and PACK2, respectively. Note that typical implementations for real and complex GEMM would yield both PACKOP1 and PACKOP2 as either the identity or linear scaling operation.
Due to the large access latency difference between RAM and the other levels of the memory hierarchy, operations performed within PACKOP1 and PACKOP2 have little to no impact on the performance of the GEMM implementation. This is due to the fact that that these operations are to be done in the registers, and are thus amortized over the time it takes to access the data from the RAM. For example, the construction of the packed sub-matrixÃ p in Line 10 of Algorithm 2 requires the scaling of the sub-matrix A
. As there is a two orders of magnitude latency ratio between RAM access (O(100s) of clock cycles) and the FLOP required to scale an element of the matrix (O(4-5) clock cycles), the cost of the scaling operation may be thought of as negligible. The same logic holds true for data rearrangement operations, such as register transpose, which will be explored in the following section.
QUATERNION MATRIX MULTIPLICATION: HGEMM
In this section, we develop the details of a high-performance implementation of quaternion GEMM for the AVX microarchitechture. The primary focus of this section is the development of AVXoptimized versions of the kernels described in the previous section for use with quaternion arithmetic and data structures. In practice, there are two primary features of the AVX microarchitechture that one must consider in the development of optimized GEMM kernels:
(1) processors with support for AVX instructions have (at least) 16 256-bit floating point (YMM) registers, and (2) AVX dictates support for SIMD (but not FMA) arithmetic instructions on these YMM registers.
For the purposes of this work, we will restrict the discussion of kernel development to double precision floating point storage, i.e. each floating point primitive will occupy 64-bits. As such, each YMM register on AVX can hold and perform arithmetic operations on up to 4 double precision floats, simultaneously. In analogy to the DGEMM and ZGEMM naming conventions of real and complex GEMM operations, we will refer to the double precision quaternion GEMM as HGEMM. As an extension of the standard construction of complex datatypes as two contiguous floats, the following developments will describe quaternion datatypes as four contiguous floats, [q 0 ; q 1 ; q 2 ; q 3 ] using the notation of Eq. (1). As such, each AVX YMM register can hold one double precision quaternion (or equivalent) at any point in time.
Batch SIMD Quaternion Multiplication
Critical to the development of an AVX-optimized quaternion microkernel is an efficient strategy for quaternion product using SIMD arithmetic operations. The the product of quaternions given by the Hamilton product in Eq. (4) requires a minimum of 16 FLOPs to complete. As each YMM register in AVX is capable of storing and manipulating 4 floats at once, one could in principle perform some of these FLOPs concurrently if the task is simply to perform a single quaternion product. However, if the task is to perform many quaternion products in a structured manner, as is the case for the rank-1 updates required by Eq. (28), implementations which optimize for a single quaternion product will yield sub-optimal throughput. To leverage the full power of SIMD instructions in this case, one needs to develop a strategy which aims to perform multiple quaternion products simultaneously at the highest throughput possible. As each YMM register is able to manipulate 4 floats, the simplest manner to reach optimal throughput is to perform 4 quaternion products simultaneously.
Consider the batch quaternion product which takes two sets of four quaternions, {p i } 4 i=1 and {q i } 4 i=1 , and returns a set of four quaternion products, {(pq) i } 4 i=1 . For simplicity in the following, we Fig. 2 . Graphical illustration of a 4x4 register transpose of 4 YMM registers containing general contiguous data structures. In the general case, this operation may be completed using 4x VPERM2F128 and 4x VSHUFPD vector instructions and 4 additional YMM registers which may be used as scratch space.
will augment the product operation to perform an update of the result as opposed to an assignment,
where • is the Hadamard (entry-wise) product such that (pq) 1 = (pq) 1 + p 1 q 1 and so on. Each quaternion product (Eq. (4)) may be separated into 4 sets of 4 FLOPs which update each component of the result, respectively. In the following, we examine the update of the scalar component of the product, (pq) 0 i , as a representative example. Note that extension and generalization to vector components of (pq) i is straightforward. The scalar components of each updated quaternion product may be obtained simultaneously by
In the SIMD paradigm, each of these vectors may be represented by a single YMM register. As such, each of these Hadamard products may be performed by the VMULPD vector instruction and each vector addition (subtraction) by the VADDPD (VSUBPD) vector instruction. In this form, the entire batch quaternion multiplication may be completed using 32 vector instructions. The structure of Eq. (30) requires that each of the sets {p i }, {q i } and {(pq) i } occupy 4 YMM registers, with each register containing a particular quaternion component of each element in the set, respectively. In other words, one YMM register contains all of the scalar components for each element of {p i }, one for the scalar components of {q i }, and so on for the vector components of these sets and for the components of {(pq) i }. For clarity in the following, we will denote the YMM register containing the scalar components of {p i }, {q i } and {(pq) i } as P 0 , Q 0 and PQ 0 , respectively, and so on for the vector parts of these sets with indices 1, 2 and 3. Using this notation, we will define the SIMD implementation of Eq. (29) as
For quaternion data structures which store a single quaternion contiguously, such as the one considered in this work, the vector load instruction (VMOVAPD) would populate each register with the 4 components of a single quaternion. As such, one would need to rearrange the quaternion data once it is read into registers in order to utilize Eq. (31). In general, this rearrangement may be achieved by a 4x4 register transpose on each of the quaternion sets. This register transpose will be denoted MM_4x4_TRANSPOSE_PD in the following and is illustrated graphically in Fig. 2 . For clarity, we endow MM_4x4_TRANSPOSE_PD with the function signature
where P 1 is a YMM register containing the components of p 1 , P 2 the components of p 2 , and so on. Remark that the result of MM_4x4_TRANSPOSE_PD is not invariant to the permutation of its parameters. Further, we note that MM_4x4_TRANSPOSE_PD is an involution. In general, register transpose is a relatively expensive operation due to the high aggregate latency of the vector instructions (VPERM2F128 and VSHUFPD) involved in its implementation. However, it will be shown in the following subsection that the special structure of the rank-1 update will simplify and cheapen the general register transpose through the use of optimal packing layouts in the GEMM operation.
The Quaternion Microkernel and Amortization of Register Transpose
Given that AVX only supports 16 YMM registers, the largest register block (Eq. (28)) which allows for C r , ì a r (κ) and ì b r (κ) to all reside in registers simultaneously is given by N r = M r = 2. As such, the quaternion microkernel must perform a sum over 2x2 rank-1 updates to update a register block of C. A single 2x2 rank-1 update requires 4 product evaluations given by
where we have dropped the (κ) super-and subscripts for brevity. Per the discussion of the previous subsection, these product evaluations may be performed simultaneously using SIMD vector instructions given that the register data arrangement adheres to the structure Eq. (31) via Eq. (32). On top of the 32 vector instructions requires to perform the product accumulations, the general scheme for register transpose depicted in Fig. 2 requires an additional 16 register operations: 8 for transposing the components of ì a r and ì b r , respectively. The operation overhead is further compounded by the fact that the microkernel performs many (K c ) rank-1 updates successively, thus this scheme costs 16K c additional operations over the execution of the microkernel. However, such a general approach for register transpose would only be required for 4 unique quaternions, whereas the 4 (unique) products required for the evaluation of Eq. (33) only involve 2 sets of 2 unique quaternions. As such, simplifications to the general register transpose scheme of Fig. 2 may be made in this case.
There are two special cases for register transpose which we must consider for Eq. (33), namely those which represent the data ordering of the elements of ì a r and ì b r , respectively:
Here, A 1 and A 2 hold the components of ì a r,1 and ì a r,2 , respectively, and similarly for B 1 and B 2 for the elements of ì b r . The YMM registers {A i } and {B i } represent the components of ì a r and ì b r in the order which they were passed, i.e. A 0 has the layout [ì a 0 r,1 ; ì a 0 r,1 ; ì a 0 r,2 ; ì a 0 r,2 ] while B 0 has the layout [ ì b 0 r,1 ; ì b 0 r,2 ; ì b 0 r,1 ; ì b 0 r,2 ] and so on. The presence of redundancies in the register transpose allows for factorization of MM_4x4_TRANSPOSE_PD into the convolution of two simpler operations:
(a) (b) Fig. 3 . Alternative register transpose schemes to efficiently handle redundancies in the general 4x4 scheme depicted in Fig. 2 . Figure 3a handles the transpose case for the elements of ì a r (κ) , and Fig. 3b for the elements of ì b r (κ) . Both of these schemes decompose the general register transpose operation into two operations; the first of which is space preserving. The labels beneath the arrows indicate the latency / reciprocal throughput for the instruction on the Intel(R) Sandy-Bridge microarchitechture.
An illustration of this state of affairs is given in Fig. 3 with Figs. 3a and 3b depicting the transpose of elements of ì a r and ì b r , respectively. The first step of Fig. 3a demonstrates the effect of ATRANS1 and the second the effect of ATRANS2, and similarly for Fig. 3b . The most important aspect of the alternative transpose schemes depicted in Fig. 3 is that both ATRANS1 and BTRANS1 are space preserving. As such, they may be factored into the packing scheme as discussed in Sec. 3.3, leading to an amortization of register operations over data movement from RAM. Further, in the case of ATRANS1, not only does this procedure reduce the number of instructions which must be issued from inside the microkernel loop, it does so in a way that the most expensive (highest latency) register operations required for the transpose are amortized in the packing procedure. In the context of Algorithms 4 and 5, this factorization may be accounted for by setting PACKOP1 = α * ATRANS1 and PACKOP2 = BTRANS1. Utilizing this packing strategy, the operation overhead for performing register transpose from within the microkernel is reduced by a factor of 3/4 (16K c to 4K c ). Algorithm 6 outlines the general structure for the AVX optimized HGEMM microkernel. The following section demonstrates its performance.
IMPLEMENTATION AND PERFORMANCE RESULTS
HGEMM, as described in the previous section, has been implemented in the quaternion BLAS (HBLAS) component of the HAXX library. HAXX (Hamilton's Quaternion Algebra for CXX) [Willams-Young 2019 ] is a modern C++ software infrastructure developed to enable efficient scalar and linear algebra operations using quaternion and mixed-type (quaternion-complex, quaternionreal) arithmetic. As of this work, HAXX provides reference and optimized serial implementations for a representative subset of BLAS-1,2,3 functionality. For the optimized implementation of HGEMM in HAXX, the arithmetic microkernel has been implemented using C++ vector-intrinsics rather than the assembly implementations which have become ubiquitous in high-performance implementation of DGEMM and ZGEMM. This has been done primarily for the fact that vector intrinsics offer a reasonable balance between transparency in the code-base and potential performance from low-level access to assembly instructions, even if this transparency comes at a slight performance degradation. In this section, we provide performance results for the reference and optimized HGEMM implementations in HAXX for the AVX microarchitechture. All timing results were obtained using an Intel(R) Xeon(R) CPU E5-2660 (Sandy Bridge) @ 2.20 GHz (max 3.0 GHz). The E5-2660 processor admits cache sizes of 32 kB, 256 kB and 20 MB for the L1, L2, and L3 caches respectively. The L3 cache is shared among all cores on the CPU. Theoretical (serial) peak matrices with α = 1 and β = 0. Timings for the HGEMM implementations are for the matrix product operation on M N (H) while those for ZGEMM are for the analogous product operation on M 2N (C) (see Eq. (26b)). The comparison between complex and quaternion operations are presented in this manner to demonstrate the efficacy of the quaternion operation over the complex operation for the same arithmetic operation, i.e. the results of these operations represent the same mathematical object (up to an isomorphism). There two primary results which are to be taken from these numerical experiments:
(1) The timing comparisons depicted in Fig. 4a illustrate that quaternion arithmetic alone is not sufficient to obtain performance leverage over tuned complex matrix multiplication. The reference HGEMM implementation is significantly less performant than the ZGEMM implementation found in MKL, while the AVX optimized HGEMM implementation outperforms the ZGEMM operation by roughly a factor of 2 (as would be expected from Tab. 2). Further, as was described in Sec. 3, the reference HGEMM implementation performs significantly under the theoretical peak performance (~7 GFLOP/s vs 24 GFLOP/s) due to the algorithm being memory bound. (2) Despite a slight difference in the FLOP rate in the GEMM implementations depicted in Fig. 4b (~22 GFLOP/s for ZGEMM-MKL and~21 GFLOP/s for HGEMM-OptAVX), the optimized HGEMM implementation consistently outperforms the optimized ZGEMM implementation even for large matrices (N > 3000).
CONCLUSIONS
In this work, we have demonstrated the efficacy and potential of high-performance quaternion linear algebra to leverage performance increases over complex linear algebra for special class of matrices through the efficient implementation of the quaternion matrix product. The software development proposed in this work extends the existing theory of high-performance serial matrix multiplication for use with explicitly quaternion arithmetic, as outlined in Secs. 3 and 4. A series of numerical experiments given in Sec. 5 have illustrated performance comparisons between reference quaternion, optimized quaternion, and vendor tuned complex GEMM implementations. It was shown that exploitation of quaternion arithmetic alone is not sufficient to outperform high-performance implementations of complex GEMM and that analogous implementations of high-performance quaternion GEMM are necessary to leverage such improvements. Further, it was shown that even in the presence of slight difference the FLOP rate comparisons, the optimized implementation of quaternion GEMM outperforms the optimized implementation of complex GEMM for the analogous arithmetic operation. We note for completeness that while Intel(R) Sandy Bridge and the AVX instruction set are not contemporary in and of themselves, they representative of more contemporary architectures such as the Intel(R) Haswell and AMD(R) Excavator architectures which support the AVX2 instruction set. In the context of the GEMM operation, the primary feature introduced in these architectures is FMA arithmetic instructions. With the exception of architectures which support the AVX-512 instruction set (such as Intel(R) Skylake-X and Intel(R) Knight's Landing), the SIMD vector units on architectures which support either the AVX or AVX2 instruction sets are 256 bits in length. Thus with the exception of the arithmetic kernel (Eq. (31)) and the optimal values of the caching parameters, the remainder of the findings in this work would would be invariant between AVX and AVX2. In summary, the optimized implementation of quaternion GEMM provided by the HAXX library was shown to outperform its MKL optimized complex analogue by roughly a factor of 2 (as would be expected from the discussion in Sec. 2.2). As the architecture on which the numerical experiments were performed is a representative example of modern HPC architectures in general, the results presented in this work would translate to other architectures given that one provides optimized versions of the GEMM kernels for the architecture in question.
As the matrix product is the fundamental building block for the development of important operations such as eigendecomposition and matrix factorization, its efficient implementation is a necessary condition for high-performance linear algebra software. In order for quaternion linear algebra to be a viable alternative to complex linear algebra in problems which it may be applied, optimized implementations of quaternion operations which outperform their complex counterparts must be developed. Although the power of the theory of quaternion algebra in the context of scientific theory and computation has been known for some time, prior to this work, no performant implementation of quaternion linear algebra has been available. It is our hope that the software developments presented in this work will aid and spark interest in the future development of high-performance quaternion linear algebra such that the full power of quaternion arithmetic may be leveraged in computationally intensive fields such as scientific computing and image processing.
