A parallel, blocked, one-sided Hari-Zimmermann algorithm for the generalized singular value decomposition (GSVD) of a real or a complex matrix pair (F, G) is here proposed, where F and G have the same number of columns, and are both of the full column rank. The algorithm targets either a single graphics processing unit (GPU), or a cluster of those, performs all non-trivial computation exclusively on the GPUs, utilizes their resources to almost the full extent with data large enough, requires the minimal amount of memory to be reasonably expected, scales satisfactorily with the increase of the number of GPUs available, and guarantees the reproducible, bitwise identical output of the runs repeated over the same input and with the same number of GPUs.
). The recent work [17] has shown that such method can be successfully blocked and parallelized for the CPUs with the shared memory, and for the clusters of those, albeit only the real matrix pairs were considered therein. Even the sequential but blocked version outperformed the GSVD algorithm in LA-PACK [1] , and the parallel ones exhibited a decent scalability.
On the other hand, an efficient, parallel and blocked onesided Jacobi-type algorithm for the "ordinary" and the hyperbolic SVD [14, 15] of a single real matrix has been developed for the GPUs, that utilizes the GPUs almost fully, with the CPU serving only the controlling purpose in the single-GPU case.
This work aims to merge the experience of those two approaches, and present a parallel and blocked one-sided (also called "implicit") Hari-Zimmermann algorithm for the GSVD on the GPU(s) as an extension of the latter, but for the complex matrix pairs as well as for the real ones.
Even though the research in parallelization of the GSVD has a long history [11, 2] , three novel and major differences from the earlier, Kogbetliantz-based procedures aim to ensure both the high performance and the high relative accuracy of this one: using the implicit Hari-Zimmermann algorithm as the basic method, that is blocked to exploit the GPU memory hierarchy, and the massive parallelism of the GPUs that suits the algorithm (and vice versa) perfectly. This paper continues with section 2, where the complex and the real one-sided Hari-Zimmermann algorithms are introduced, together with the general, architecturally agnostic principles of their blocking and parallelization. In section 3 the single-GPU implementation are described in detail, while in section 4 the most straightforward multi-GPU implementation approach is suggested. The numerical testing results are summarized in section 5, and the paper concludes with some directions for future research in section 6. In Appendix A a nonessential but computationally cheap way for enhancing the accuracy of the real and the complex dot-products on the GPUs is proposed. The full testing results and some details of the chosen Jacobi strategies are provided as the supplementary material.
The complex and the real one-sided Hari-Zimmermann algorithms
In this section the complex and the real one-sided HariZimmermann algorithms are briefly described. Please see [7, 8] for a more thorough overview of the two-sided algorithms, and [17] for a detailed explanation of the real implicit HariZimmermann algorithm. This paper closely follows the terminology and the implementation decisions of [20] , where the complex generalized hyperbolic SVD based on the implicit Hari-Zimmermann approach has been introduced, but without the hyperbolic scalar products (i.e., the signature matrix J is taken to be identity here) and without forming the right generalized singular vectors X.
Let the matrices F and G be of size m F × n and m G × n, respectively, with min{m F , m G } ≥ n. Then, Z is square of order n, and assume that n ≥ 2. Otherwise, for n = 1, the GSVD of (F, G) is obtained by taking
Even though the algorithm works on the rectangular matrices, it might be beneficial performance-wise to avoid transforming very tall and skinny (block)columns by working on the square matrices instead. To shorten F and G, the problem is transformed by computing the QR factorization of F with the column pivoting, FP 1 = Q F R F , and then G, with its columns prepermuted by P 1 , is shortened by the column-pivoted QR factorization, (GP 1 )P 2 = Q G R G . The square matrices F := R F P 2 and G := R G , both of order n, take the place of F and G in the algorithm, respectively. With Σ = Σ in the decompositions of (F, G) and of (F , G ), the matrix Z from the former, soughtfor decomposition can be recovered by using P := P 1 P 2 and the computed Z from the latter as Z := P Z .
It is assumed that diag(B) = I, i.e., the column norms of G are unity. Should it not be the case, F and G are then prescaled by a nonsingular, diagonal matrix Z 0 , where (Z 0 ) j j := 1/ g j F , g j is the jth column of G and 1 ≤ j ≤ n; otherwise, Z 0 := I.
The iterative transformation phase starts with the matrix pair 
Simultaneous diagonalization of a pair of pivot matrices
An iteration (or "step") k ≥ 0 of the sequential non-blocked Hari-Zimmermann algorithm consists of selecting a pair of indices (i k , j k ), 1 ≤ i k < j k ≤ n, and thus two 2 × 2 pivot submatrices, one of A k := F * k F k , If Z k is embedded into an n × n matrix Z k such that Z i k i k ;k := Z 11;k , Z i k j k ;k := Z 12;k , Z j k i k ;k := Z 21;k , Z j k j k ;k := Z 22;k , while letting Z k be the identity matrix elsewhere, then looking two-sidedly the congruence with Z k transforms the pair (A k , B k ) into a pair (A k+1 , B k+1 ), where A k+1 := Z * k A k Z k and B k+1 := Z * k B k Z k . Onesidedly, the transformation by Z k orthogonalizes the i k th and the j k th pivot columns of F k and G k to obtain F k+1 := F k Z k and G k+1 := G k Z k . Also, Z k is accumulated into the product Z k+1 := Z k Z k . In a one-sided sequential step only the i k th and the j k th columns of F k , G k , and Z k are effectively transformed, in-place (i.e., overwritten), postmultiplying them by the 2 × 2 matrix Z k , while the other columns of these matrices remain intact:
f i k ;k+1 f j k ;k+1 = f i k ;k f j k ;k · Z k , g i k ;k+1 g j k ;k+1 = g i k ;k g j k ;k · Z k , z i k ;k+1 z j k ;k+1 = z i k ;k z j k ;k · Z k .
As diag( B k+1 ) = diag( B k ) = I 2 , it follows that diag(B k+1 ) = diag(B k ) = I n . However, due to the floating-point rounding errors, these equations might not hold. To prevent diag( B k ) to drift too far away from diag(I 2 ) as the algorithm progresses, the version is equivalent to the standard (previously described) one, for which it can be formally set A k := A k and B k := B k .
Suppose that Z k has been computed (by either version) such that it diagonalizes A k and B k , but a i k i k ;k+1 < a j k j k ;k+1 . To keep diag( A k ) sorted descendingly, swap the columns of Z k by a permutation P k := 0 1 1 0 to obtain Z k := Z k P k . Such Z k will swap the i k th and the j k th columns of F k and G k as it orthogonalizes them. Sorting in each step is a heuristic that speeds up the algorithm notably in practice (see section 5), but it makes reasoning about the convergence harder and is not strictly necessary.
Computing Z k from A k and B k is more involved in the complex case than in the real one. However, in both cases, first it is established whether the i k th and the j k th columns of F k and G k are numerically relatively orthogonal,
where ε is the precision of the chosen floating-point datatype 1 .
If they are, no non-trivial transformation is to take place, and 1 The relation relies on the expected (as opposed to the worst case) rounding Z k := P k , since still the column swap may be warranted. Rescaling by D k is thus not performed even for D k I 2 , since it might perturb the columns sufficiently enough for them to cease to be numerically orthogonal.
The complex case
The transformation matrix Z k is sought in a form [7, 20] 
To that end, let 
and, noting that t k > 0 since B k is positive definite, with these quantities compute
In these ranges of the angles, for θ ∈ {2ϑ k , γ k } the trigonometric identities cos θ = 1/(1+tan 2 θ) and sin θ = tan θ cos θ hold when θ < π/2. Otherwise, tan θ = ∞, cos θ = 0, and sin θ = 1. Then, compute cos(2ϑ k ), sin(2ϑ k ), cos γ k , and sin γ k , and with them finally obtain
where 0 ≤ ϕ k < π/2 and 0 ≤ ψ k < π/2.
error for the dot-products [4] that form the elements of A k and B k , and while sensible in the real case, it is probably too tight in the complex case, where a more careful analysis of the complex dot-products might be employed in the future work and a handful of transformations subsequently might be skipped.
and a i k i k ;k = a j k j k ;k , then tan γ k is undefined, and tan(2ϑ k ) might also be. In that case, it can be shown that A k and B k are diagonalized by
The real case
The transformation matrix Z k is sought in a form [7, 17] 
To that end, let x k := b i k j k ;k and
and compute
Note that cot(2ϑ k ) and cot ϑ k (and the corresponding tangents) have the same sign in the range of ϑ k . Assuming that the floating-point arithmetic unit does not trap on ±1/0 and 1/∞, obtain tan ϑ k as
, and from it cos ϑ k and sin ϑ k using the same trigonometric identities as in the complex case. Finally, compute
An exception. Since the real case is in fact a simplification of the complex case, when cot(2ϑ k ) is undefined, being 0/0, i.e., when a i k i k ;k = a j k j k ;k and a i k j k ;k = a i k i k ;k b i k j k ;k (or, in other words, when A k and B k are proportional), define
Parallelization of the one-sided algorithm
The sequential one-sided algorithm in each step chooses a single pivot index pair, according to some criterion that is called a sequential Jacobi strategy. However, at most n/2 pivot column pairs of each matrix can be transformed concurrently if the indices in all index pairs are distinct.
In a parallel step k ≥ 0 a sequence (i ( )
k ) of pivot index pairs, where 1 ≤ ≤ n/2 , such that each index in the range from 1 to n appears at most (and for even n, exactly) once in the sequence, addresses n/2 pivot column pairs of A k and B k to be transformed-each pair by a separate, concurrent task.
All permutations of a given (i ( ) k , j ( ) k ) are equivalent from the numerical point of view, since the resulting A k+1 and B k+1 are the same for every reordering of the sequence, and therefore any reordering represents the entire equivalence class.
For simplicity, a barrier is assumed between the successive parallel steps, i.e., all tasks of a step have to be completed before those of the following step are started.
A criterion to choose a pivot index pair sequence for each parallel step is called a parallel Jacobi strategy. Among the strategies that are simplest to compute are the ones that prescribe a pivot sequence for each step, until all n(n − 1)/2 possible index pairs (i, j) are selected at least once. Then, the choice of the steps is periodically repeated. Let s be the shortest such period. The first s steps constitute the first sweep, the following s steps the second sweep, and so on.
If in any sweep exactly n(n − 1)/2 different index pairs are chosen, such a strategy is called cyclic; otherwise, some index pairs are repeated in a sweep, and the strategy is called quasicyclic. For even n, s ≥ n − 1, and the equality holds if and only if the strategy is cyclic.
A strategy is defined for a fixed n; however, by a slight abuse of the usual terminology, a single principle by which the particular strategies are generated for some given matrix orders will simply be called a strategy kind, or even a strategy for short.
Based on the previous experience with the one-sided Jacobilike algorithms, two parallel Jacobi strategy kinds have been selected for testing: the modified modulus (mm; see, e.g., [16, 17] ), quasi-cyclic with s = n, and the generalized MantharamEberlein (me; see [12, 14] ) cyclic one. Please see Figures 1 and   2 in the supplementary material, where a sweep of me and of mm, respectively, is shown two-sidedly on a matrix of order 32.
Blocking of the one-sided algorithm
Parallelization alone is not sufficient for achieving a decent performance of the algorithm on the modern architectures with multiple levels of the memory hierarchy.
The pointwise algorithm just described is therefore modified to work on the block columns of the matrices, instead of the columns proper. Each block column comprises an arbitrary but fixed number w, 1 < w < n/2 , of consecutive matrix columns.
Instead of 2 × 2 pivot submatrices of A k and B k , in the blocked algorithm 2w × 2w pivot submatrices A ( ) k and B ( ) k are formed in the kth (parallel or sequential) step by matrix multiplications,
, and Z k , which leads to the full block (fb) algorithm [10] , as called in the context of the Jacobi methods, or their off-diagonal norms can be reduced by a sequence of congruences accumulated into Z ( ) k , which is called the block-oriented (bo) algorithm [9] . The idea behind blocking is that A 
The computation of Z ( ) k in either fb or bo can be done by any convergent method; a two-sided method can be applied straightforwardly, but for the one-sided approach A ( ) k and B ( ) k have to be factorized first by, e.g., the Cholesky factorization
and then the same implicit Hari-Zimmermann method, either pointwise or blocked, and in both cases, either parallel or sequential, can be recursively applied to F 
In both cases, let However, the QR factorization (even without the column pivoting) of a pair of the tall and skinny block columns comes with a significant performance penalty on a GPU compared to the Cholesky factorization of a small, square pivot submatrix [14] , and the pivoted Cholesky factorization does not avoid a possibility of getting a severely ill-conditioned A ( )
by multiplying an ill-conditioned pair of block columns by itself. Therefore, both of these enhancements are only briefly described here but have not been tested.
In the following, the blocked algorithm is assumed to form the pivot submatrices as
i.e., each one by a ZHERK (DSYRK in the real case) like operation in the BLAS terminology, and the non-pivoted Cholesky factorization is then used to obtain F ( )
Rescalings
Observe that Z This is exactly the same scaling as it would be performed in the last, post-iterative phase of the algorithm,
do not have to be rescaled and the norms of their columns do not have to be kept as they are all discarded immediately after Θ ( ) k has been computed. 6
k , is applied to the pivot block column pair of F k , G k , and Z k instead of Z ( ) k , and is considered embedded into Z k in a similar way as Z ( ) k would be in the pointwise case, i.e., starting from Z k being I n , for each let
where the subscripting is to be interpreted as in Fortran.
To reduce the norm of the entire Z k , a similar rescaling can be applied on Z k , using the column norms of F k and G k , after each but the last block sweep. After the last block sweep, a rescaling of all three matrices (F N , G N , and Z N ) is performed to obtain U, V, and Z, with the extraction of Σ F , Σ G , and Σ.
Convergence
The inner level of the algorithm stops when there were no transformations, apart from the sorting permutations, applied in a sweep, or when the prescribed maximal number of sweeps has been reached. Then, the pivot block column pairs of F k , G k , and Z k are updated concurrently for all by Z ( ) k , which can be skipped for those where Z ( )
The same criterion could be used for the outer level, where the count of transformations applied in an outer sweep is a sum of all transformations applied in the inner level in all steps of the outer sweep. However, this criterion has to be relaxed [14, 17] , since the rounding errors in forming and factorizing the block pivot submatrices could spoil the attained numerical orthogonality of the original columns, and introduce a small number of unwarranted transformations that prevent the algorithm from detecting convergence even if it has in fact been reached. Therefore, the transformations are divided in two classes:
"big" and "small". The latter are all Z ( ) k where either:
i.e., where Z ( ) k is close to (a multiple of) identity, and the former are all other transformations. Note that neither definition of the small transformations implies that sin ϕ k or sin ψ k are numerically equal to zero (and are usually not). Also, since x k tends to zero and therefore t k to one in the last sweeps of the algorithm, the first and the second definition should not differ significantly.
There are separate counters of the big transformations, and of all transformations applied in the inner level of the algorithm.
The inner level still halts when there were no transformations of any class in a sweep, but the outer level stops when there were no big transformations applied in an outer sweeps (i.e., small transformations are allowed to occur but do not spoil the overall convergence). Such a heuristic criterion prevents in practice a long sequence of outer sweeps at the end of the algorithm, with only a few transformations close to identity in each.
Variants of the algorithm
To summarize the variants of the algorithm, see Table 1 .
The first column, ID, sets a shorthand for the corresponding ID convergence transformations dot-products
variant. The second column specifies a convergence criterion used. The third column distinguished between assuming the column norms of the second matrix to be unity, and rescaling of both matrices with each transformation. The fourth column relates to computing the dot-products the usual way, or by an enhanced, possibly more accurate procedure from Appendix A. Unless specified otherwise, the column sorting is employed in all cases. Thus, e.g., DHZ3-(mm-fb-me) refers to the double-precision real implicit blocked Hari-Zimmermann algorithm with ID equal to 3, using mm at the outer and me at the inner level of blocking of type fb. Similarly, ZHZ0-(me-bome) stands for the double-precision complex implicit blocked
Hari-Zimmerman algorithm with ID equal to 0 (the "standard" variant), using me at both levels of blocking of type bo.
From now on, when a numeric variant ID is mentioned in the text, it is assumed that it should be looked up in Table 1 .
The single-GPU implementation
In this section the single-GPU implementation of the complex and the real one-sided Hari-Zimmermann algorithms are described. The focus is on the complex algorithm, and the real one is commented on when substantially different. The target framework is CUDA C++ [18] for the NVIDIA GPUs (Kepler series and newer), but also another general-purpose GPU programming environment with the analogous concepts and constructs could probably be used.
Data layout and transfer
Due to blocking employed by the algorithm, each matrix is viewed as column-striped, with the block columns containing w = 16 consecutive columns each. To simplify the implementation, assume that n is a multiple of 32, and let n := n/w (so n is even). If the assumption does not hold for the input, the matrices should then be bordered by appending 32 − (n mod 32) columns to the right, and as many rows to the bottom. The ele-
. . , in the columns newly added to the matrix Y ∈ {F, G} should be set to unity, to avoid introducing zero columns, since it is essential for Y to be of full column rank. Other bordering elements should be set to zero, to prevent any transformation not implied by the original matrices from happening (see a bordering example in [16] ).
Another assumption, to simplify the loop unrolling in various parts of the code, is to have m F and m G as a multiple of 64.
If it is not the case with the input, then, after a possible bordering as described above, 64 − (m Y mod 64) rows of zeros should be appended to the bottom of the matrix Y ∈ {F, G}.
The CPU and the GPU RAM layout and transfer
Data is laid out in the GPU RAM (also called "global memory" in the GPU context) in the following sequence:
The vectors (Σ, Σ F , and Σ G with double-precision elements, each of length n, and C holding unsigned 8-byte integers, of length n), are output only, while the rest of data is used both for input and output, i.e., the six double-precision matrices are constantly being read and overwritten within the GPU as the algorithm progresses. The matrices are loaded to the GPU at the beginning of the algorithm's execution, if they are not already in place as a result of another computation, and optionally copied to the CPU at its end, as well as Σ, Σ F , and Σ G .
In the pre-and post-processing stages on the CPU, input (F, G) and output data (U in place of F; V in place of G; and Z), respectively, is repacked from (or to) the standard representation of complex matrices, in which the successive elements are complex numbers z = (Re(z), Im(z)). Each double-precision matrix can therefore be loaded to, or copied from, the GPU with a single CUDA call.
The purpose of splitting the real and the imaginary matrix parts is twofold. First, as there is no direct support for the standard C (with Complex types) or C++ (with std::complex types) complex arithmetic in CUDA, apart from the ones provided by the datatypes and the routines from the cuComplex.h header file or the thrust library, a decision was made to keep all data in real-typed variables. Second, reading or writing only one component of the successive matrix elements is straightforward, and can be performed by a contiguous memory access.
In the auxiliary vector C there are two counters, C (0) and
, where is the index of a thread block in a grid of the main computational kernel. In C (0) the count of the "big" transformations, and in C (1) the count of all transformations applied in all 8 kernel launches within a single block sweep are accumulated.
At the beginning of each block sweep C is zeroed out on the GPU, and is copied to a CPU vector C at the end of the sweep.
The shared memory layout
For each thread block, the non-constant, non-register data (comprising three complex matrices: F, G, and Z) for the main computational kernel is laid out in the shared memory as:
Each double-precision matrix is square, of order 32, with the elements stored in Fortran array order, for a total shared mem- 
The constant memory layout
The constant memory on the GPU holds the pointers to the matrices and the vectors described above, with their dimensions, to avoid sending them as parameters in each kernel call.
The Jacobi strategy table for the first, pointwise level of the algorithm is also stored in the constant memory, since it does not depend on the actual input data.
The strategy table contains 31 (or 32) rows, same as the number of steps of a chosen (quasi-)cyclic parallel strategy.
Each row is an array of 16 index pairs (p, q), with p < q, where no two indices in a row are the same. A pair of such indices addresses a pair of columns of the matrices F and G to be transformed concurrently with all other column pairs in the step.
Constants in the global memory
The Jacobi strategy table for the second, block level of the algorithm might not fit in the constant memory for the large n, so it has to be stored in the global memory in such a case. It is similarly formatted as the table for the pointwise level, but with n − 1 (or n) rows, each with n/2 index pairs. Here, a pair (p, q), Both tables are precomputed on and preloaded from the CPU [15, 20] before any computation starts on the GPU.
Arithmetic operations
Since the data is held in the real-valued arrays only, the complex arithmetic is performed manually, computing the real and the imaginary parts of the result separately, rather than assembling the complex operands in the CUDA format each time an operation has to be performed, and disassembling the result when it has to be stored back in memory.
Complex arithmetic
The arithmetic operations on complex numbers needed by the algorithm are addition, subtraction, negation, complex conjugation, multiplication by a complex or a real number (or an inverse of the latter), and taking the absolute value. Only |z|, a·b, and an FMA-like operation a · b + c (i.e., a complex multiplication and an addition "fused") require special attention, while the rest are trivial to express using the real arithmetic directly in the code.
The absolute value is obtained as |z| := hypot(Re(z), Im(z)), without undue overflow. Still, it is possible that |z| overflows when at least one component of z is close enough by magnitude to the largest representable finite double-precision number, but such a problem can be mitigated by a joint downscaling of two matrices under transformation. For example, a scaling by 1/2 would suffice, and would also keep the significand intact for all normalized (i.e., finite non-subnormal) numbers. Such rescaling has not been implemented, though it would not be overwhelmingly hard to apply the rescaling and restart the computation if any thread detects that its |z| operation has overflowed, and makes that known to other threads in a block by a subsequent syncthreads count CUDA primitive invoked with a
Boolean value indicating the presence of an overflow. For c = 0 it holds zfma(a, b, c) = zmul(a, b) for all a and b.
Real arithmetic
The real arithmetic uses operations with the accuracy guarantees mandated by the IEEE 754 standard for floating-point arithmetic in rounding to nearest (ties to even) mode, except in the optional enhanced dot-product computation, where rounding to −∞ is also employed, as described in Appendix A.
A correctly rounded (i.e., with the error of no more than half ulp) double-precision rsqrt(x) := 1/ √ x device function, provided by Norbert Juffa in private communication, that improves the accuracy of the CUDA library routine of the same name, is called wherever such an expression has to be computed.
Integer arithmetic
To keep the memory requirements low, the pointwise level indices in the strategy table are stored as unsigned 1-byte integers, while the block level indices occupy 2 bytes each (i.e., n ≤ 65536, what is enough to exceed the RAM sizes of the present-day GPUs).
For dimensioning and indexing purposes the unsigned 4-byte integers (after a possible promotion) are used, since their range allows for addressing up to 32 GiB of double-precision floating-point data, which is twice the quantity of GPU RAM available on the testing hardware. However, 8-byte integers should be used instead if the future GPUs provide more memory than this limit.
Although Fortran array order is assumed throughout the paper and the code, the indices on a GPU are zero-based. The CUDA thread (block) indices blockIdx.x, threadIdx.x, and threadIdx.y are shortened as b x , t x , and t y , respectively.
Initialization of Z (with optional rescaling of F and G)
Here, the first of three computational kernels, initFGZ 2 , is described. Its purpose is to initialize the matrix Z, having been zeroed out after allocation, to Z 0 , a diagonal matrix such that (Z 0 ) j j := 1/ g j F , and to rescale F and G to F 0 and G 0 , by multiplying the elements of each column j of the matrices by (Z 0 ) j j , in the variants 0, 1, 4, and 5. Else, Z 0 = I n .
The kernel is launched once, before the iterative phase of the algorithm starts, with a one-dimensional grid of n/2 thread blocks, each of which is also one-dimensional, with 64 threads (i.e., 2 warps of 32 consecutive-numbered threads).
Each warp is in charge of one column of F, G, and Z, i.e., its threads access only the elements i of that column j, where
A warp reads 32 consecutive elements of Re(G) j and Im(G) j at a time. Each thread updates its register-stored partial sumŝ
using one FMA operation for each update, and this is repeated by going to rows i : Either way, z j [t x ] := 1/ g j 2 F is then computed, and the jth columns of F and G are scaled by z j [t x ] in a loop similar to the one described above, i.e., for i in steps of 32 while i < m F ,
and then the same scaling is performed on G, with i < m G .
Finally, z j [t x ] is written to Re(Z) l j by the lowest-numbered thread in a warp, i.e., t x ≡ 0 (mod 32), where l is an index making a physical column j treated as a logical column l. In the single-GPU case, l = j. In the variants 2, 3, 6, and 7, Re(Z) l j is set to 1 and no other processing occurs. 
F is computed, and if non-unity and f, g j is scaled by 1/ g j 2 F , as well as
Finally, if f, Σ j , Σ F; j , and Σ G; j are written to the GPU RAM by a thread t x ≡ 0 (mod 32). All variables indexed by t x above are per-thread and register-stored, unless a register spill occurs.
The main computational kernel
The main kernel comes in bstep1s and bstep1n versions, where the former is the default one, with the column sorting, while the latter is a non-sorting version.
The kernel is called once per a block step, and that call, followed by an explicit device-wide synchronization (cudaDeviceSynchronize) constitutes the entire block step. The synchronization occurs also after overwriting the memory with zeroes, initFGZ, and rescale calls. From the CPU perspective, the algorithm is therefore purely sequential.
The kernel's grid is one-dimensional, with n/2 two-dimensional thread blocks, each of them having 32 × w = 512 threads.
A thread block := b x in the block step k := k mod n is in charge of one pivot block column pair, (p ( )
, of F, G, and Z, where n is n − 1 for the me or n for the mm strategy kind.
The computational subphases of bstep1(s/n)(k), Since no two pivot block index pairs share an index, all thread blocks can be executed concurrently without any interdependencies or data races. Due to the shared memory requirement, a high thread count and a heavy register pressure, it is not expected that more than two (or, in the real case, three) thread blocks could share a single GPU multiprocessor. Therefore, for the matrices large enough, only a fraction of all thread blocks in the grid can execute at the same time on a GPU. It is a presumption (but not a requirement) that the CUDA runtime shall schedule a thread block for execution at an early opportunity after a running one terminates, thereby keeping the GPU fully occupied despite of the possible execution time variations (i.e., the number of the inner sweeps and the transformations required) among the thread blocks, especially in the fb case.
Note that t y addresses a warp, 0 ≤ t y < w, and t x , 0 ≤ t x < 32, denotes a lane (a thread) within the warp. Throughout a thread block, each warp is in charge of two "ordinary" (i.e., not block) columns, in the global or in the shared memory, but of which two varies between and within the subphases. 
Each thread holds four register-stored variables,
initially set to zero, that hold the real (first two) and the imaginary (last two) parts of two (partial) dot-products of the columns of F k and, in the second instance, of G k , where t y := t y + w. For each j, let t x := (t x + j) mod 64, and
Then, two fused multiply-add operations perform the updates,
The first updates constitute a computation of the dot-product of The first loop iterates over 0 ≤ j < w. First, the jth diagonal element of Re(A), a j j , is read (the imaginary part is assumed to be zero) if t y = j and t x ≥ j (i.e., in the threads of the jth warp which correspond to the lower triangle, called the "active" threads), and the thread block is then synchronized.
The active threads then scale the jth column below the diagonal, each thread the real and the imaginary part of its element in the t x th row, by 1/ a j j , while the diagonal is set to ( a j j , 0), and the thread block is then synchronized.
Next, the columns to the right of the jth have to be updated, with all warps (but not all their threads) participating in the up-
and the thread block is synchronized. However, this only updates the columns from j + 1 to j + w. The same update has to be performed with j := j + w instead of j , i.e., if t x ≥ j (which also ensures that j < 32),
and another thread synchronization occurs.
The second loop over w ≤ j < 32 is identical to the first one, except that t y is used instead of t y and the second updates (of the j th columns) are not needed since j ≥ 32. The pointwise implicit Hari-Zimmermann algorithm is described in section 2, subsections 2.1, 2.2, and the relevant parts of subsections 2.4 and 2.5, and implemented as follows.
13
The t y th warp transforms the pairs of columns of F, G, and Z in each inner step l ≥ 0. Let l := l mod 31, since the me strategy is used exclusively at the inner level in the tests. Each of the three pivot pairs comprise the columns indexed by p y;l and q y;l , where the indices are read from the lth row of the inner strategy table at the position t y . Within a warp, the t x th thread is responsible for the elements in the t x th row of those columns.
First, Z is initialized similarly to the procedure described in subsection 3.3, but on the shared memory level. In the variants 2, 3, 6, and 7, the diagonal of Re(Z) is set to unity, and the rest to zero, by the threads in charge of those elements. In the variants 0 and 4, the sum of squares of the magnitudes of the elements of the columns g j , i.e., g j In the step l and the warp t y , let i := p y;l and j := q y;l . The elements of the three pivot column pairs are loaded into the registers by each thread reading its row from the shared memory, after which the thread block is synchronized. For each original element, there are two variables for its real and imaginary parts, and two more variables to hold the value of the new element after transformation, since the old value is used twice in computing the new one and thus cannot be overwritten. 
where Y ∈ {F, G, Z}. If one or both scaled cosines lying on the diagonal of Z l are equal to one, the transformation can be (and is) simplified by removing the corresponding multiplications without numerically affecting the result.
In bstep1s, to determine if the column swap is required, the squares of the norms of the transformed columns of F are computed as the sum-reduced sums of squares of the magnitudes of the new (F ) elements, depending on the variant. Those two values are however not stored for the next step, because that would require an additional shared memory workspace that might not be available on all supported architectures.
In the real case it is easy to compute instead the transformed diagonal elements of the first pivot submatrix directly [17] : The initial skews are defined as ı := (t y + t x ) mod 32 and ı := (t y + t x ) mod 32. Then, in each iteration of the unrolled inner loop over 0 ≤ j < 32 the local elements of Π are updated,
after which ı and ı are cyclically shifted as ı := (ı + 1) mod 32 and ı := (ı + 1) mod 32. When the inner loop terminates, the thread block is synchronized. This procedure is called thrice to update F k , G k and Z k , after which the kernel execution (i.e., the kth outer step) terminates.
Dataflow and the shared memory perspective
In Figure 1 Apart from the copy-ins, copy-outs, and zeroings of data, there is no scope for using any non-default CUDA streams.
Also, as no GPU computation, except the fast rescale, can be overlapped with any CPU task, the execution time of the algorithm depends almost solely on the GPU performance and the time required to set up a kernel call. No part of the algorithm might benefit from being executed multi-threadedly on the CPU, except the reductions ofs andb for very large matrices, which have not been so implemented.
Note that the algorithm stops whenb = 0, i.e., when no big transformations occurred in an outer sweep. The "global" counts S and B of all and of big transformations applied during the execution of Algorithm 1 are only informative here, but they are consulted in the multi-GPU algorithm's stopping criterion.
The multi-GPU implementation
When the input data is larger than the available RAM on a single GPU, it is necessary to either split the workload among the several GPUs, or resort to some out-of-core technique for swapping the parts of data in and out of the GPU as the computation progresses. Here, only the former approach is followed, since it is simpler, more efficient and widely applicable now when the multi-GPU installations are becoming ubiquitous.
There is no single, best and straightforward way of generalizing a single-GPU algorithm to multiple GPUs. For the (ordinary and hyperbolic) SVD, the approach in [14] was to distribute the matrix over the GPUs, shorten the assigned part of the matrix, run the single-GPU algorithm on the shortened part, update the original (non-shortened) columns, and redistribute the parts. Despite its decent performance, such a three-level algorithm suffered from the increased memory usage and some numerical difficulties, both with the stopping criteria and with the relative accuracy obtained.
A different approach is taken here, with a goal of achieving the optimal GPU and CPU memory usage (without any work arrays) and a better accuracy, but with a possible performance penalty induced by transforming the tall and skinny parts of the matrices directly, without any shortening. As no floatingpoint computation is performed on the CPU, while the GPU computation still does not rely on any numerical libraries, it is guaranteed that the results stay bitwise identical in the repeated runs over the same input data with the same parameters and the same number of GPUs.
Algorithm setup
In this subsection the multi-GPU computational environment, the input and the output data distribution across the CPUs and the GPUs, the communication-aware Jacobi strategies, and the algorithm's initialization are explained.
MPI environment
Unlike in [14] , where the multiple GPUs were assumed to belong to the same node, and thus a separate CPU thread of a single process could be dedicated for driving the computation on each GPU, here a more flexible solution has been chosen, by assigning to a GPU a single-threaded MPI [13] process. The
GPUs can thus be selected from any number of nodes, with a different number of them on each node. Also, the GPUs are not required to be architecturally identical or even similar across the processes in an MPI job, as long as they all have enough RAM available.
The count of GPUs, and thus the governing MPI processes (s), for the multi-GPU algorithm is not constrained in principle, save for being at least two (otherwise, the single-GPU algorithm is sufficient), and small enough so that at least two (but for the reasons of performance, a multiple of 32) columns of each matrix are available to each process, when the matrices are divided among them columnwise, as described below. The upper bound on the number of GPUs is a tunable parameter in the code, while the MPI implementation might have its own limit on the number of processes in a job.
The MPI processes need not be arranged in any special topology, i.e., only the predefined MPI COMM WORLD communicator is used. A GPU and its governing process are jointly referred to by the process' rank (r) in that communicator.
It is assumed that the MPI distribution is CUDA-aware in a sense that sending data from the GPU RAM of one process and receiving it in the CPU RAM of another (including the same) process is possible with the standard MPI point-to-point communication routines (i.e., no manual CPU buffering of the GPU data is necessary).
Also, the number of elements of each local submatrix has to be at most INT MAX, which at present is the upper limit on the count of elements that can be transferred in a single MPI operation [6] . That limit is easily circumvented by transferring the data in several smaller chunks, but such chunking has not been implemented since it was not needed for the amount of RAM (16 GiB) of the GPUs used for testing. That issue will have to be addressed for the future GPUs.
Data distribution
The matrices F, G, and Z, and the vectors Σ F , Σ G , and Σ, are assumed to always stay distributed among the MPI processes,
i.e., at no moment they are required to be present in entirety in any subset of the processes. The amount of the CPU and the GPU memory required is identical (i.e., not depending on r), constant throughout the computation, and derivable in advance from m F , m G , m Z := n, and s for all processes.
If n mod s 0, or (n/s) mod 32 0, or m F mod 64 0, or m G mod 64 0, the matrices F, G, and Z are bordered as described in subsection 3.1, but requiring that the enlarged n satisfy n mod (32 · s) = 0.
The columns of the bordered matrices can be distributed evenly among the processes, such that each process is assigned ) is the second stripe in Im(G [r] ).
A similar distribution is in place for Σ F , Σ G , and Σ, where each process holds Σ (r)
G , and Σ (r) in the CPU RAM, and
G , and Σ [r] in the GPU RAM, where each allocation is of length n and is unused in the algorithm until after the last step.
Each process also has its convergence vectors C (r) and
of length n/w, in the CPU and in the GPU RAM, respectively.
Communication-aware Jacobi strategies
The parallel Jacobi strategies, as defined in subsection 2.2, do not contain any explicit information on how to progress from Therefore, given either me or mm strategy table for the order n := n/w (with the stripes seen as the block columns), each process independently computes and encodes the strategy's communication pattern before the start of the algorithm. Such a computation requires O(n 3 ) comparisons, but since n is a small number an unacceptable overhead is not incurred. The computation can be (but it has not been) parallelized on a CPU, e.g., by turning the outer loop over k into a parallel one. The multi-GPU algorithm then references the following encoded mapping when progressing from one step to the next.
After each outermost (multi-GPU) step k of the algorithm, the first of each two joined stripes on the rth GPU has to be transferred to either the first or the second stripe on the t {0} th CPU, for some t {0} . Similarly, the second stripe on the rth GPU has to be transferred to either the first or the second stripe on the t {1} th CPU, for some t {1} , establishing a mapping
where p k,r and q k,r are indices of the first and the second stripe in the entire matrix, 0 ≤ p k,r < q k,r < n, while t {0} k,r t {1} k,r are the MPI ranks. If the p k,r th stripe globally (i.e., the first locally) has to be transferred to the first stripe in the t {0} k,r th process, that is encoded as −(t {0} k,r + 1), else the second stripe of the target is encoded as (t {0} k,r + 1). Similarly, if the q k,r th stripe globally (i.e., the second locally) has to be transferred to the first stripe in the t {1} k,r th process, that is encoded as −(t {1} k,r + 1), else the second stripe of the target is encoded as (t {1} k,r + 1). Adding 1 to the rank ensures that the rank 0 can be encoded as either 1 or −1.
The number of steps in a sweep, denoted by n , is n − 1 for me, and n for mm. The strategy mapping, once computed, can be reused for multiple runs of the algorithm, as long as the strategy kind, n (after bordering), and s do not change between the runs.
Algorithm initialization
First, the CPU memory is allocated in each process, and the data is loaded (e.g., from a file), assuming k = 0, i.e., the rth process contains the p 0,r th and q 0,r th stripes of F, G, and Z.
Then, the device memory is allocated, if it is not already available, and an MPI barrier is reached. Timing of the algorithm includes everything that occurs from this barrier on, except the final, optional deallocation of the device memory.
The constant memory on each GPU is then set up, and the stripes are copied to the device (global) memory, all of which could be, but has not been, done asynchronously. The device is then synchronized, becoming ready for the computation.
It remains to be decided how many sweeps S in Algorithm 1 to allow. As with the pointwise level, there are two obvious choices: either some reasonably large number, e.g., 30 (as in fb), or 1 (as in bo). Now a variant of the multi-GPU algorithm is specified by the selected variant of the single-GPU algorithm, with the outermost strategy and the choice of S added; e.g., ZHZ0-(me-bo, me-fb-me) for me and S = 1, respectively, using ZHZ0-(me-fb-me) at the single-GPU level.
As shown in subsection 5.3.2, the imbalance of the computational time each GPU requires with fb (i.e., one GPU may need more sweeps in Algorithm 1 than another to reach convergence within an outermost step) is significantly detrimental to the overall performance-contrary to the single-GPU case (see subsection 3.5). Unlike there, where such imbalance between the thread blocks' sweep counts is offset by a large number of thread blocks to be scheduled on a small number of multiprocessors, here in the multi-GPU case there is a one-to-one correspondence between the number of tasks to perform and the number of execution units (GPUs) to perform them, so the time required for an outermost step depends on the slowest run of Algorithm 1 within it. Thus, bo is recommended here instead.
The main part of the algorithm
In the pre-iterative part of the algorithm, initFGZ kernel is called (see subsection 3.3), once in each process. Specifically, it is not called again in the context of Algorithm 1. Here, the row offset l in initFGZ is calculated according to the logical (not physical) index of a column, i.e., l := j + p 0,r · w if j < w, and l := j − w + q 0,r · w otherwise, with p 0,r and q 0,r sent to the kernel as parameters. The device is then synchronized, and the stripes of F 0 , G 0 , and Z 0 are ready on each GPU (copying them to the CPU is not needed) for the iterative part of the algorithm. The second principle is to facilitate, as much as practicable, hiding the communication overhead behind the GPU computation. Before a call of Algorithm 1 occurs within an outermost step of a given process, the non-blocking receives to the CPU stripes are started in anticipation of an early finish of the GPU work of the step in the processes that are to send their transformed stripes to the process in question. That way, while the given GPU still computes, its CPU can in theory start or even complete receiving one or both transformed stripes required in the following step. There remains an issue with several slowly progressing processes that might keep the rest of them idle, but at least the point-to-point data transfers can happen soon after the data is ready, not waiting for a massive data exchange with all processes communicating at the same time.
Point-to-point communication and reductions
The third principle is to minimize the memory requirements of both the CPUs and the GPUs by sending the transformed data from the GPU RAM of one process to the CPU RAM of another two. That way, no separate, "shadow" GPU buffers are required to receive the data. The CPU stripes have to be present anyhow, Algorithm 2: The non-blocking receives in the kth step of the rth process. tag := 1; i := 0; // tag tells which stripe from a sender has to be received; i indexes requests foreach o ∈ {0, 1} do // first or second destination's stripe // variable i is assumed to hold the last value assigned to it in Algorithm 2 in the kth step foreach o ∈ {0, 1} do // first or second source's stripe
o , Z // variable i is assumed to hold the last value assigned to it in Algorithm 3 in the kth step MPI Waitall(i, r, statuses); // wait for all pending MPI Irecv and MPI Isend requests to complete
foreach V ∈ {Re, Im} do // real or imaginary part copy V(W) to V(Y) using cudaMemcpy2D; // can be done asynchronously using cudaMemcpy2DAsync to load the inputs and to collect the outputs, so they are reused as the communication buffers, with a penalty of the additional CPU-to-GPU data transfers after the main data exchange.
Matching a stripe to be sent from one process to a stripe that has to be received in another process is accomplished by MPI tags annotating the messages. In the complex case there are twelve stripes in total (six in the real case, since there are no imaginary stripes) to be received by a process in a single outermost step (see Algorithm 2 for their tag numbers).
When a message comes to a process, from any sender, it is only accepted if it bears a valid tag (between 1 and 12, inclusive) and the message data is stored in the corresponding stripe, as in Algorithm 2. Likewise, when a stripe has to be sent, the strategy mapping is consulted to get the destination process' rank, and decide if the stripe should become the first or the second one at the destination. According to that information, the message's tag is calculated as in Algorithm 3.
The CPU part of the algorithm
The pre-iterative, iterative, and post-iterative parts of the al- 
Numerical testing
The purpose of the numerical testing of the single-GPU and the multi-GPU algorithms is twofold. In both the real and the complex case, by accuracy is meant the relative normwise errors of the decompositions of F and G,
computed the same way as in [20] . Namely, X had to be explicitly obtained as Z −1 by solving the linear system ZX = I.
First, the LU factorization of Z with complete pivoting was performed by the LAPACK routine DGETC2 (or ZGETC2), followed by the system solving using the routine DGESC2 (or ZGESC2).
The ensuing matrix multiplications and the Frobenius norm computations were using Intel 80-bit hardware-supported extended precision (REAL(KIND=10) in GNU Fortran), to reduce the effects of the rounding errors on the final result while avoiding the expensive, emulated quadruple (128-bit) precision.
Testing environment and data
The testing environment was the same for all tests, as de- optionally, copy The testing data is synthetic (not from any application domain) and is described in subsection 5.1.2. More information on its availability can be found in the supplementary material.
Testing environment
The testing environment comprises two Intel Xeon Silver 
Testing data
Two datasets have been generated: a "small" and a "large" The real matrix pairs in the small dataset were generated in quadruple datatype (REAL(KIND=16) in Intel Fortran) and rounded to double precision datatype. The same test generation method was employed as in [17] . The required BLAS and LAPACK routines had been adapted as required. The core of the generation are two quadruple-adapted LAPACK testing routines: xLAGSY, that generates a pseudorandom symmetric matrix, here of the full bandwidth, from a given diagonal prescribing the eigenvalues of the matrix; and xLAROR, that here multiplies a given matrix from the left by a pseudorandom orthogonal matrix. The diagonals of Σ F , Σ G , and Λ X were generated by DLARND, a standard LAPACK's pseudorandom num-ber generator, here with the uniform probability distribution on (0, 1), such that only those values returned by it that had been greater than 10 −10 were accepted. Then, UΣ F was generated from Σ F , and VΣ G from Σ G , both using xLAROR, while X was obtained from Λ X using xLAGSY. After that, F := UΣ F · X and G := VΣ G · X (in quadruple). The generator finishes with a call to the preprocessing part, DGGSVP3, of the LAPACK's GSVD method (see [1] and the routine's comments), to make the data usable for comparison with DTGSJA Kogbetliantz-type GSVD routine from LAPACK. Such a comparison, however, has never been made since the DTGSJA computation on any available CPU would be prohibitively expensive, as it was indicated in [17] and recently confirmed (albeit only in the complex case) in [20] .
The complex matrix pairs in both datasets were generated by a much simpler procedure, described in [20] . Namely, each matrix in a pair was generated by a call to ZLATMS LAPACK testing routine as Hermitian and positive definite, with its pseudorandom eigenvalues uniformly distributed in (0, 1).
Testing results for the single-GPU algorithm
When presenting the performance of several variants, a de- 
Performance in the real case
In Figure 2 , the wall execution time of four subvariants of DHZ0 (the fastest of eight variants from Table 1 on the largest matrix pair), the number of outer sweeps for two fastest subvariants on the small real dataset, and the detailed execution time table for DHZ0-(me-fb-me), which is almost always the fastest of the four subvariants, are shown.
In Table 2 the intervals of relative slowdown of other real single-GPU (me-fb-me) variants compared to DHZ0-(me-fb-me) (the reference variant) are given. In Figure 3 , the wall execution time of four subvariants of ZHZ4 (the fastest of eight variants from Table 1 on the largest matrix pair), the number of outer sweeps for two fastest subvariants on the small complex dataset, and the detailed execution time table for ZHZ4-(me-fb-me), which is almost always the fastest of the four subvariants, are shown.
In Table 3 the intervals of relative slowdown of other complex single-GPU (me-fb-me) variants compared to ZHZ4-(mefb-me) (the reference variant) are given. Since ZHZ0-(me-fbme) is the fastest variant on average, differing from ZHZ4 only subtly (in the convergence criterion), ZHZ0 is used instead of ZHZ4 in the multi-GPU algorithm and in subsection 5.2.4.
Accuracy in the real case
In Figure a justification for the shape of the error graphs.
In Table 4 the spectral condition numbers κ 2 (G) in the small real dataset are given, but without a corresponding error figure, since it is almost indistinguishable from Figure 4 . the enhanced dot-products offer a small advantage in accuracy, but not so significant that it would not be offset by a drop in performance when the focus is on the latter.
Testing results for the multi-GPU algorithm
Due to a limited availability of multiple GPUs in the testing environment, only the complex case in the variant 0 was tested on the large dataset and compared with a single-GPU baseline.
Here, the full accuracy testing was skipped, due to the huge computational demands of the matrix inversions and of the error calculation in extended precision. For n = 18432, the relative errors in the corresponding outputs with one and with two GPUs (in both cases in all variants that were timed) were compared and all of them, for both F and G, were found to be less than 1.7 · 10 −12 . Also, for a given matrix, the relative errors in all cases differed less than 10 −13 , indicating that the multi-GPU algorithm did not introduce any instability in the computation.
A single-GPU baseline
In Table 7 the wall time in seconds and the number of the outermost sweeps are shown for the me-fb-me and the me-bome variants, with and without the column sorting, of the ZHZ0 single-GPU algorithm, as a baseline for the comparison with the multi-GPU algorithm. As the similar benefits of the column sorting were obvious in other trial tests runs, the non-sorting version was not considered for subjecting it to the full testing. 
The multi-GPU performance
In Tables 8, 9 
Conclusions
It is clear from Tables 8, 9 , 10, and 11, that the multi-GPU variant B is to be recommended when performance matters.
Dividing the shortest single-GPU baseline wall time from Table 7 for n = 18432 with the wall times from Table 9 for the same n but with a different number of GPUs, it can be derived that the speedup with two GPUs is 1.28×, with four GPUs is 1.90×, and with eight GPUs is 2.66×. These speedups could be even lower on a slower network or if there were fewer than four GPUs per node in the testing environment, but nevertheless they are satisfactorily increasing with the number of GPUs.
It would be interesting to see for what number of GPUs the speedup for this and other test cases peaks or levels out, but that is unfortunately beyond reach of the testing environment.
Future work
Several generalization of the algorithms' design are possible for the other implicit Jacobi-type methods that are to be ported to the GPU(s). One such method is a computation of the generalized hyperbolic SVD (GHSVD) by a modification of the implicit Hari-Zimmermann algorithm, as described in [20] .
Appendix A. Enhanced dot-product computation
In CUDA, the rounding mode can be specified explicitly for each arithmetic operation using the intrinsic functions. That makes an ideal setting for employing a trick from [5] to cheaply compute possibly more accurate real and complex dot-products. 
