Small-to medium-sized polynomial eigenvalue problems can be solved by linearizing the matrix polynomial and solving the resulting generalized eigenvalue problem using the QZ algorithm. The QZ algorithm, in turn, requires an initial reduction of a matrix pair to Hessenbergtriangular form. In this paper, we discuss the design and evaluation of high-performance parallel algorithms and software for Hessenberg-triangular reduction of a specific linearization of matrix polynomials of arbitrary degree. The proposed algorithm exploits the sparsity structure of the linearization to reduce the number of operations and improve the cache reuse compared to existing algorithms for unstructured inputs. Experiments on both a workstation and an HPC system demonstrate that our structure-exploiting parallel implementation can outperform both the general LAPACK routine DGGHRD and the prototype implementation DGGHR3 of a general blocked algorithm.
There are infinitely many linearizations to choose from and they tend to be structured and relatively sparse [1, 4, 14] . In this paper, we exploit the structure of a particular Fiedler linearization to devise an efficient algorithm for HT reduction that has lower arithmetic and communication costs than general HT reduction algorithms. The Fiedler linearization of interest is illustrated below for the case d = 5, with a straightforward generalization to arbitrary degrees (see also Appendix A). The sparsity structure is illustrated in pictorial form in Figure 1 .1. This particular linearization is used by the eigensolver quadeig for dense quadratic (d = 2) eigenvalue problems [5] . It has several desirable properties. First of all, L(λ) is always a lin-+ λ earization of P (λ) in the sense that E(λ)L(λ)F (λ) = diag P (λ), I (d−1)n for some matrix polynomials E(λ) and F (λ) with constant nonzero determinant. As a result, det P (z) = γ · det L(λ) for some nonzero constant γ, so that P and L have the same finite eigenvalues. In fact L(λ) is a strong linearization: it has the same finite and infinite elementary divisors as P [13, Thm. 3] . The pencil L(λ) in (1.2) is easy to construct from the coefficients P i of P (λ), without performing any arithmetic operation. The right and left eigenvectors x and y of P (λ) are easily recovered from those of L(λ). Indeed, if z and w are right and left eigenvectors of L(λ) then
where e j denotes the jth column of the n × n identity matrix and ⊗ denotes the Kronecker product. Also, when P i 2 ≈ 1 for i = 0 : d then L(λ) has good conditioning and backward stability properties [3, 6, 7] . Moreover the pair (A, B) is almost already in HT form since A is in upper block Hessenberg form with 2n − 1 subdiagonals and B is in block upper triangular form. It is straightforward to transform (A, B) using unitary equivalence such that A becomes upper block Hessenberg with only n subdiagonals and B becomes upper triangular. Furthermore we show in Appendix A that prior to the HT reduction, deflation of zero and infinite eigenvalues contributed by singular trailing and leading coefficient matrices can be performed without altering the block structure of (A, B) in (1.2) . This is in part due to the position of P 0 and P d in A and B.
The rest of the paper is organized as follows. The preprocessing step is described in Section 2. A known algorithm for reduction from block HT form to proper HT form is recalled in Section 3 and its effect on the sparsity structure when applied to inputs of the form (1.2) is analyzed. This is combined with recent work on cacheefficient algorithms for the unstructured case in Section 4, and the construction of cache-efficient algorithms tailored for the structured case is described thereafter. In Section 5 we discuss a parallelization scheme suitable for multicore-based systems. Numerical results are presented in Section 6 and Section 7 concludes.
2.
Preprocessing. If P 0 is singular then P (λ) has zero eigenvalues since P (0)x = P 0 x = 0 for x ∈ null(P 0 ). Similarly, if P d is singular then P (λ) has eigenvalues at infinity: these are the zero eigenvalues of the reversal polynomial λ d P (1/λ). Although the QZ algorithm handles eigenvalues at infinity well [15] , numerical experiments by Kågström and Kressner [8] show that if infinite eigenvalues are not extracted before starting the QZ steps then they may never be detected due to the effect of rounding errors in floating point arithmetic. It is then desirable to deflate these eigenvalues before the HT reduction step. Starting from the linearization (1.2) we show in Appendix A how to deflate zero and infinite eigenvalues of P (λ) contributed by singular leading and trailing coefficients, resulting in a pencil A + λB of dimension (d − 2)n + r 0 + r d with the same block structure as in (1.2) but with blocks of different dimensions. Here r i = rank(P i ) for i = 0, d.
To avoid clutter in the notation and for simplicity we assume in what follows that no deflation occurred so that all the blocks in (1.2) are n × n. The block upper triangular matrix B can immediately be reduced to upper triangular form by a QR factorization of its top left block (i.e., P d ). The lower bandwidth of the block upper Hessenberg matrix A can be reduced from 2n − 1 to n by a QR factorization of its bottom right subdiagonal block (i.e., P 0 ). More specifically, let P d = Q d R d and P 0 = Q 0 R 0 be QR factorizations of P d and P 0 , respectively. By defining the transformation matrices 
The pictorial form of (2.2) is illustrated in Figure 2 .2.
Hessenberg-Triangular Reduction via
Bulge-Chasing. Assume that the pair (A, B) has already been preprocessed as explained in Section 2 and that Q and Z are the resulting transformation matrices. The aim of this section is to recall a known algorithm that takes a matrix pair in block HT form and reduces it to proper HT form via a bulge-chasing procedure. This algorithm appears as the second stage of a two-stage HT reduction algorithm for general dense matrix pairs [2, 9] . Algorithm 1 is a reformulation of the bulge-chasing algorithm using our notation. The inputs to Algorithm 1 include the degree, d, and size, n, of the polynomial matrix as well as the preprocessed pair (A, B) and the resulting transformation matrices Q and Z. The algorithm reduces (A, B) from block HT form with n nonzero subdiagonals to proper HT form by systematically applying a carefully chosen set of plane rotations. The product of the rotations correspond to a unitary equivalence transformation of (A, B). The algorithm returns the reduced matrix pair (A, B) in proper HT form along with the updated transformation matrices Q and Z.
In Section 3.1 we define the computational kernels that appear in Algorithm 1. The algorithm itself and its effects on unstructured inputs are described in Section 3.2. In Section 3.3 we present a block partitioning of A and B that comes naturally from the algorithm. The definition of the block partitioning depends on the iteration index c in the sense that the block boundaries shift one step south-east from one iteration to the next. To avoid ambiguities we introduce notation for referring to a specific block in a specific iteration. Finally, the effect of the algorithm on the sparsity structure of a preprocessed pair (A, B) from Section 2 is analyzed in Section 3.4. The result of this analysis forms the basis for our further developments.
Computational kernels introduced in Algorithm 1.
The four computational kernels that appear in Algorithm 1 are described below.
• G ← Zero(A, r, c) Constructs a row rotation G that zeros out A(r, c) using A(r −1, c) as a pivot.
Constructs a column rotation G that zeros out A(r, c) using A(r, c + 1) as a pivot.
Applies a row rotation G to rows r 1 and r 2 of A. The effect is limited to the 
Applies a column rotation G to columns c 1 and c 2 of A. The effect is limited to the row range r 1 : r 2 .
The bulge-chasing algorithm and its effect on unstructured inputs.
Algorithm 1 systematically zeros out the undesired nonzero entries below the first subdiagonal of A. The entries are zeroed one column at a time from left to right (loop over c on line 1) and from the bottom up within each column (loop over r on line 2). The entry A(r, c) is zeroed out by a row rotation involving rows r − 1 and r constructed on line 6.
A row rotation on rows r − 1 and r introduces an undesired nonzero entry (a fill or bulge) in the subdiagonal entry B(r, r − 1). Unless the bulge is removed, the lower triangular part of B would rapidly fill in and its sparsity structure would be destroyed. Therefore, the bulge is removed using a column rotation that acts on columns r−1 and r. The rotation is constructed on line 11 and applied to A, B, and Z on the following lines. This creates, however, a second bulge at A(r + n, r − 1) unless r + n > dn in which case no new bulge appears. At most one bulge is present at any given time in either A or in B. It helps to interpret the algorithm as if it moves a single bulge back and forth between A and B until the bulge finally disappears near the bottom right corner. The precise movement of the bulge is illustrated in Figure 3 of the bulge. In each iteration of the while-loop on line 5, an entry is zeroed out in A, then a bulge is generated and zeroed out in B, and finally the bulge reappears in A. The loop continues to chase the bulge until it disappears. It follows that the blocks on the subdiagonal generally have size n × n, except possibly the first and the last blocks, which may be rectangular. As the partitioning is symmetric, the number of blocks in either dimension is
Block partitioning induced by
The dependence on c manifests itself as a shift by one down the diagonal from one iteration to the next, as illustrated by the red dashed lines in Figure 3 .2 for the second iteration (c = 2).
To unambiguously refer to a specific block relative to a given partitioning/iteration, we introduce the following notation. The block (i, j) X c refers to block (i, j) of matrix X (either X = A or X = B) with respect to the block partitioning in iteration c.
The effect of Algorithm 1 on structured inputs.
We now study the effect of Algorithm 1 on the sparsity structure when the input is a preprocessed matrix pair as defined in Section 2. Figure 3 .3 illustrates the generic sparsity structure of A and B for the case d = 5 and n = 4. Even though Algorithm 1 eventually destroys the structure of all four of the involved matrices (A, B, Q, and Z), the process turns out to be gradual and the evolving sparsity structure can be exploited, as we will soon explain, to reduce both the arithmetic and communication costs.
Informally, the initial sparsity structure shown in Figure 3 .3 evolves during the execution of Algorithm 1 as follows. The dense upper triangular block of B grows slowly by one column to the right for each iteration of the outer for-loop. The dense first block row of A also grows slowly by one row for each iteration of the outer forloop. The upper triangular block of A, however, grows rapidly by one row for each iteration of the inner for-loop. Moreover, this block also grows slowly by one column for each iteration of the outer for-loop.
We refer to one iteration of the inner for-loop on line 2 of Algorithm 1 as a sweep, and Figure 3 .4 illustrates the sparsity structure obtained after the first sweep. Note the appearance of a new column on the triangular block of B, a new row on the dense first block row of A, and a new row and column on the upper triangular block of A. The two remaining sweeps in the current iteration of the outer for-loop each add one row to the triangular block of A, but they have no other effect on the sparsity structures of either A or B. After completing the first iteration of the outer for-loop, the sparsity structure takes on the form illustrated in Figure 3 .5. The effect on the sparsity structure is similar for all iterations. It follows that after d − 2 iterations the two dense blocks of A will fuse together, and after n iterations the triangular block of A will reach the east boundary of the matrix.
These observations suggest that some of the arithmetic operations required to update the matrix B can be eliminated since they operate on zeros and thus have no effect. A fraction of the arithmetic operations required to update the matrix A can be eliminated as well, but in general this cost reduction is more modest since the structure of A is rapidly destroyed.
Cache-Efficient
Algorithms. Algorithm 1 accesses memory with a large stride and performs only a few arithmetic operations for every accessed matrix entry. Following previous work on HT reduction for general inputs [9] , we apply in this section a set of standard techniques that reduce the number of cache and TLB misses as well as the volume of communication between cache and main memory.
In Section 4.1 we explain how to rearrange the steps of Algorithm 1 into a more cache-friendly pipelined algorithm. In addition, the pipelined algorithm exploits the structure of B. In Section 4.2 we further improve on the pipelined algorithm by blocking for the memory hierarchy and avoiding a redundancy that exists in the accumulation of Q and Z.
Pipelined algorithm.
The steps of Algorithm 1 can be reordered such that all bulges stemming from iteration c are chased together in a pipelined fashion. This is made possible thanks to the associativity of matrix multiplication. In floating point arithmetic the different ordering leads to differences in the computed results but both the basic and the pipelined variants are backwards stable. We refer to Section 4.1.2 below for descriptions of the computational kernels that appear in the pipelined algorithm. Left(G, A, r 1 , r 2 , c + 1, dn); The pipelined Algorithm 2 uses the concept of a sequence of rotations, that is, a sequence G = G 1 , G 2 , . . . , G m consisting of m rotations acting on adjacent pairs of rows or columns. More specifically, for an increasing sequence G, the rotation G k acts on rows/columns i + k − 1 and i + k for some constant base index i. In other words, the rotation sequence G affects the m + 1 rows or columns i : i + m. For a decreasing sequence G, rotation G k acts on rows or columns i + m − k and i + m − k + 1. Again, the sequence affects the rows or columns i : i + m. We only encounter decreasing sequences in what follows.
The pipelined algorithm exploits the structure of B and the mostly zero columns at the east side of A. The variables r 1 , r 2 , c 1 , and c 2 keep track of the submatrix (of A or B, depending on the context) that currently contains (or is about to contain) the set of bulges. The outer loop iterates over the columns from left to right. Column c of A is zeroed out on line 3, and the resulting sequence of rotations is immediately applied to A and accumulated into Q. The resulting bulges in B are generated and immediately zeroed out on line 7 followed by further updates of B and accumulation into Z. Entering the loop on line 10, the column rotations are applied to A on line 12. A sequence of row rotations is constructed on line 14 to zero out the resulting bulges in A followed by updates of A and accumulation into Q. The affected portion of B is an identity matrix and therefore any row rotation generates a bulge that is zeroed out by the same rotation applied to the columns. Hence, the net effect on B is nil and there is no need to explicitly generate the bulges or even reference B. The only thing that remains to do is to accumulate the row rotations (turned column rotations) into Z.
Column striping.
Assuming that the matrices are represented in the column-major storage format, a column rotation can be implemented such that it accesses two contiguous streams of memory, but a row rotation will necessarily access memory with a large stride. Since the pipelined algorithm consistently applies sequences of rotations, row rotations can be more efficiently applied by partitioning the columns into stripes and apply a sequence fully to each stripe in turn. In this way, any matrix entries that are brought into the cache by the application of one rotation is likely to still be present in the cache by the time the next rotation is applied. Such a column striping technique does not only improve the cache reuse but also reduces the number of TLB misses [9] .
Computational kernels introduced in Algorithm 2. Algorithm 2
introduces five computational kernels, which we describe below.
• G ← LeftZero(A, r 1 , r 2 , c) Constructs a decreasing sequence G consisting of r 2 − r 1 row rotations that zeroes out all entries except the first one in the column vector A(r 1 : r 2 , c).
The vector is updated in the process.
Applies a decreasing sequence consisting of r 2 − r 1 row rotations G to the submatrix A(r 1 :
Applies a decreasing sequence consisting of c 2 − c 1 column rotations G to the submatrix A(r 1 :
Applies a decreasing sequence consisting of r 2 − r 1 row rotations G to the upper triangular and square submatrix B(r 1 : r 2 , c 1 : c 2 ) while at the same time constructing and applying a decreasing sequence consisting of c 2 − c 1 column rotationsĜ that zeroes out the bulges created by G. The updated submatrix is again upper triangular.
Applies a decreasing sequence consisting of c 2 − c 1 column rotations G to the upper trapezoidal submatrix A(r 1 : r 2 , c 1 : c 2 ) while at the same time constructing and applying a decreasing sequence consisting of r 2 − r 1 row rotationsĜ that zeroes out the bulges created by G. The updated submatrix is again upper trapezoidal.
Blocked algorithm.
The pipelined algorithm has better cache reuse than the basic algorithm yet it still applies only a constant number of floating point op- 
/* * * Phase II: Delayed updates * * * */
21
Accumulate erations (approximately 3 to 6) per accessed matrix entry and iteration of the outer loop. Since each iteration accesses most of the matrices, it follows that the arithmetic intensity will be small for sufficiently large matrices. Significant improvements of the cache reuse therefore necessarily require the reordering of operations spanning several iterations of the outer loop. The main difficulty is to handle the large number of data dependencies that are caused by the intrinsic interleaving of row and column rotations.
Fortunately, the accumulation of rotations into Q and Z is performed exclusively using column rotations and can be effectively reorganized for improved cache reuse. However, much of the updates of A and some of the updates of B involve interleaved row and column updates that cannot be easily separated. Some improvements of the cache reuse are nonetheless possible.
Related works such as [9, 11, 12] describe how to apply a blocking technique to the accumulation of Q and Z as well as to those parts of A and B that are only affected by column rotations. Recent work by one of the authors demonstrates how to block also the updates of A (and B) [10] . For the latter technique to be effective, however, the lower bandwidth should be much smaller than the size of the matrices, which is typically not the case here since the degree tends to be small relative to n and hence the bandwidth, n, is large relative to the size, dn, of the matrices.
Let us now isolate the parts of the matrices which are updated only by column rotations from those parts which are updated by both row and column rotations. This will help us understand the blocked algorithm that follows. As mentioned above, all the accumulations into Q and Z are performed using column rotations. The processing of the ( The blocked Algorithm 3 improves on the pipelined algorithm from Section 4.1 in part by introducing blocking and in part by avoiding a redundancy that occurs in the accumulation of Q and Z, the latter of which is explained in Section 4.2.1. The blocked algorithm zeroes out columns of A from left to right one column panelč :ĉ at a time. Each iteration of the outer loop on line 2 reduces one panel. Each block iteration consists of two phases. In Phase I ("panel factorization"), the columnsč :ĉ are zeroed out and the resulting bulges are chased using what amounts to the pipelined algorithm. This is accomplished by the outer for-loop on line 4. Only the submatrices A(č + 1 : dn,č + 1 : dn) and B(č + 1 :č + n,č + 1 :č + n) are updated in Phase I since the other updates are not required to construct the rotations. In Phase II ("delayed updates"), the rotations are regrouped and accumulated into small unitary matrices, which are then applied using matrix multiplication to Q, Z, A(1 :č,č + 1 : dn) and B(1 :č,č + 1 :ĉ + n).
Elimination of redundancy.
The transformation matrices Q and Z obtained from the preprocessing step (Section 2) are actually identical except for their top left n×n block. Due to the identities on the main block diagonal of B, many of the row and column rotations are identical too (see Section 4.1). As it turns out, while Q and Z gradually diverge over the course of the pipelined and blocked algorithms, they maintain a shrinking number of identical columns. Specifically, the number of columns shared by Q and Z shrinks by one column for every column reduced. Immediately before reducing column c, the matrices Q and Z can be partitioned into column blocks
such that both Q 1 and Z 1 have min{dn, n + c − 1} columns and Q 1 = Z 1 (in general) and Q 2 = Z 2 (always). During the reduction of column c, a decreasing sequence of rotations will engage the right-most parts of Q 1 as well as the left-most column of Q 2 . Similarly, another rotation sequence will engage the right-most parts of Z 1 and the left-most column of Z 2 . All other sequences affect entries exclusively in Q 2 and Z 2 and, crucially, do not touch the left-most columns in neither Q 2 nor Z 2 . Moreover, the same rotations are applied to both Q 2 and Z 2 . It follows that at the end of iteration c, the left-most columns of Q 2 and Z 2 will differ but the other columns will remain identical. This allows us to avoid the redundant accumulation of rotations into both Q 2 and Z 2 by only accumulating into Q 2 . Prior to iteration c, we copy the left-most column of Q 2 into the corresponding column of Z 2 . See 
Regrouping rotations for increased cache reuse.
We increase the cache reuse in Phase II by adopting a known technique for applying multiple overlapping sequences of rotations [12] . This technique exploits the commutativity of rotations acting on disjoint sets of rows or columns and rearranges the rotations into diamond-shaped groups, each of which can be applied cache-efficiently.
However, we cannot apply this technique directly since the rotations obtained from one bulge-chasing sweep do not form one but several non-overlapping sequences. Fortunately, these shorter sequences can be merged by inserting identity rotations to artificially create one long sequence.
Consider the first three iterations of the outer for-loop over c for the case d = 4, n = 5. Let G c k denote the row rotation constructed in iteration c to annihilate a k,c using a k−1,c as a pivot. This rotation acts on rows k − 1 and k. Let I k denote an identity rotation acting on rows k −1 and k. Figure 4.2(a) illustrates the row rotations and their dependencies for the first three column reductions. Each vertex represents a rotation and each arc represents a dependence caused by non-commutativity. Any topological ordering of the graph corresponds to a valid reordering of the rotations (and vice versa). The region with a dashed outline is an example of how to group rotations into a diamond-like shape. Such a group has the property that there is no directed path leaving and re-entering the region. This implies that the rotations in the group can be applied in sequence without any other rotation in between. In particular, the rotations in a group can be explicitly accumulated into a unitary matrix and applied using matrix multiplication. Fig. 4.2 . Row rotations and their dependencies for the first three iterations of the case d = 4, n = 5. In (a), the rotations G c k for c = 1, 2, 3 are shown together with their dependencies. The region with a dashed outline exemplifies how to create a group of rotations that can be applied separately from the others. In (b), nine identity rotations have been added to make the graph structure more regular. Any ordering of the (proper) rotations allowed by (b) is also allowed by (a). Hence, replacing (a) by (b) does not compromise correctness.
The graph in Figure 4 .2(a) has "gaps" (e.g., rotation G 1 7 does not exist). In Figure 4 .2(b), the gaps have been filled with identity rotations that have no effect other than to make the structure of the graph more regular. The graph in (b) is a restriction of the graph in (a) in the sense that any topological ordering of (b) contains a topological ordering of (a). The graph structure in Figure 4.2(b) is identical to the one in [12] and hence the technique described therein applies directly. Another viewpoint is illustrated in Figure 4 
Accumulating transformations.
The sparsity structure of all the transformation matrices (Q, Z, and the small matrices U i in Algorithm 3) are thickly banded matrices with some additional structure within the band. More precisely, they each take the form of a skyline matrix where the top and bottom skylines are non-decreasing. Formally, let t j and b j denote the row index of the first respectively last (structural) nonzero in column j. Then t j+1 ≥ t j and b j+1 ≥ b j . By explicitly keeping track of the t j and b j during the accumulation procedure, it is straightforward and inexpensive to determine the minimal range of rows to which a given rotation must be applied and thereby optimally reduce the number of arithmetic operations.
Parallel Algorithm.
Even though our proposed blocked algorithm requires less computation and communication through the memory hierarchy than the conventional algorithm it is still quite costly in terms of the number of flops with an arithmetic complexity of O(d 3 n 3 ). We have therefore constructed a parallel algorithm for multicore-based systems implemented using threads.
The parallel algorithm is structured similarly to Algorithm 3 and consists of iterations over panels. Each iteration consists of two phases, each of which is executed in parallel. Successive phases are separated with a barrier synchronization. The parallelization of Phase II is described in Section 5.1 and the (more technical) parallelization of Phase I is described in Section 5.2.
Parallel Phase II.
The parallel algorithm for Phase II consists of an accumulation step followed by an update step with a barrier synchronization in between. In the accumulation step, the rotations are regrouped and each rotation group is accumulated into a small unitary matrix. The accumulations of different groups are independent and the accumulation step is therefore perfectly parallel.
In the update step, the groups containing shared rotations are first applied to Q. This is then followed by the copying of a set of columns from Q to Z as part of the elimination of redundancy. The update step ends by applying the rotation groups that contain non-shared rotations to Q and Z. Also the update step is perfectly parallel in the sense that the rows can be independently updated. However, the load is not uniformly distributed over the rows due to the sparsity structure of Q and Z. We address this problem by calculating ahead of time the number of flops per row and partition the rows of Q and Z such that the flops are approximately balanced over the threads.
Parallel Phase I. The parallel algorithm for Phase I consists of multiple iterations, each of which interleaves row and column rotations with the zeroing out of matrix entries.
The steps in each iteration of the loop in Phase I (see Algorithm 3) can be rewritten in terms of coarse operations acting on the blocks of the natural block partitioning. In doing so we also take the opportunity to further increase the cache reuse by exploiting the fact that all the blocks in the upper triangle are subject to both a row sequence and a column sequence. By fusing the application of these two sequences together into a single kernel we can reduce its communication volume by up to 50%. The new kernel is defined as follows:
• RightLeft(G, G, A, r 1 , r 2 , c 1 , c 2 ) Applies a decreasing sequence of row rotations G and a decreasing sequence of column rotations G to the submatrix A(r 1 : r 2 , c 1 : c 2 ). The number of row rotations is r 2 − r 1 and the number of column rotations is c 2 − c 1 . The new version of Phase I is given in Algorithm 4. With each operation in Algorithm 4 executed by a single thread, the number of tasks would be limited to O(d 2 ) and the average degree of concurrency would be only O(d). Therefore, a large degree would be necessary to saturate a typical multicore-based system. Since the degrees are predominantly small in practice we further decompose each operation into finer grained tasks. This in turn requires an additional set of computational kernels, which we define in Section 5.2.1. The task decomposition, dependencies, and scheduling of the tasks is described in Section 5.2.2.
Task decomposition of Phase I.
The task decomposition that we use is to a great extent induced by a partitioning of each block into square sub-blocks defined by a sub-block size b and aligned with the top left corner of the block. This in turn induces a partitioning of the sequences of rotations into sub-sequences of length b. The reason for square sub-blocks is to ensure that a sub-block of the row rotations that become column rotations align with the sub-block partitioning of the columns. The LeftZero operation involves little work and is not decomposed into tasks in order to avoid the associated scheduling overhead. (Moreover, the operation is inherently sequential.) Instead it is merged with the LeftRightZero operation that follows.
The RightLeft operation takes two sequences as input: one row sequence and one column sequence. Figure 5.1(c) illustrates the kernel that forms the basis of the implementation. Shown is one of the sub-blocks and its immediate surroundings. The kernel applies to the sub-block first the column sequence and then the row sequence. Note that the column (row) rotations affect one additional column (row) to the east (south). The region affected by the row rotations has also been shifted one column to the east since the left-most column in the sub-block is not fully up-to-date with respect to the column rotations. Special cases occur at the south and east boundaries of the block since the updates do not extend beyond the block. The task graph and the relationships between the tasks and the sub-blocks are illustrated in Figure 5.2(a) . With n b = n/b sub-blocks in either dimension, the task graph contains n 2 b tasks and has depth 2n b − 1. The LeftRightZero operation is built upon the two computational kernels illustrated in Figures 5.1(a-b) and its task graph is illustrated in Figure 5.2(b) . The SubLeftRightZero kernel applies to the sub-blocks on the diagonal and the SubLeftRight kernel to the sub-blocks above the diagonal. Again, the regions affected by row, respectively, column rotations are not perfectly aligned to the sub-blocks, as is evident from Figure 5 .1. With n b sub-blocks in either dimension, the task graph contains n b (n b + 1)/2 tasks and has depth 2n b − 1.
The RightLeftZero operation is built upon the three computational kernels illustrated in Figures 5.1(d-f) . Interestingly, even though the name and functionality of this operation is similar to LeftRightZero, they are not symmetric and as a consequence their task decompositions are rather different. The task graph of RightLeftZero is illustrated in Figure 5 .2(c) and the tasks on the diagonal use the kernel SubRightLeftZero. Each sub-block in the strictly upper triangular part contains one SubRight task (closer to the bottom right corner) and one SubLeft task (closer to the upper left corner). With n b sub-blocks in either dimension, the task graph contains n 2 b tasks and has depth n b + 1.
Scheduling Phase I to promote cache reuse.
In the scheduling of Phase I we also attempt to increase the cache reuse. Specifically, the parallel implementation of Phase I begins by explicitly constructing the task graph for the first iteration. The tasks are then statically scheduled using the critical path heuristic (largest height first). We then reuse the same schedule for all subsequent iterations in the phase. One consequence of this approach is that each thread is statically assigned a subset of the tasks and by extension also a subset of the sub-blocks. This is what increases the cache reuse. Figure 5 .3 illustrates the task graph associated with one iteration of the first instance of Phase I for a polynomial of degree four. The sub-block size is set to one fourth of the block size. The labels on the nodes indicate the kernel used and are the same as the labels used in Figure 5 of 5 for a 2.2 average degree of concurrency. (Since consecutive iterations partially overlap in practice, the average degree of concurrency is actually slightly larger in both cases.) 6. Experiments. We developed two separate implementations for different types of systems: a sequential implementation for workstations following Algorithm 3 and a thread-based implementation based on the parallel algorithm described in Section 5. The sequential algorithm is intended for use on workstations via a MATLAB MEX interface and takes advantage of potential implicit parallelism by calling MATLAB's BLAS library. The parallel implementation is intended for use on multicore-based HPC systems.
We measured the performance of our codes on two systems (one workstation and one HPC system) with different characteristics:
• Workstation. The workstation has an Intel Core i7-2600 processor with four cores and a nominal clock frequency of 3.4 GHz. • HPC system. The HPC system is a cluster consisting of interconnected shared-memory nodes. Each node consists of four AMD Opteron 6238 processors each with twelve cores split into two NUMA domains. The nominal clock frequency is 2.6 GHz. Both systems run a Linux-based operating system. All experiments were performed in double precision (64-bit) real floating point arithmetic. All execution times are given in seconds.
Results from the workstation.
On the workstation we measured the execution time of our MEX code with implicit parallelism via the parallel BLAS used in MATLAB and compared it against the HT reduction function hess in MATLAB, our parallel code with p ∈ {1, 2, 3, 4} threads, and a prototype implementation, DGGHR3, of the blocked one-stage HT reduction algorithm described in [9] . The results were obtained using MATLAB version R2011b for Linux.
The results are summarized in Table 6 .1 for n ∈ {500, 1000} and d ∈ {2, 3, 4}. The times are given in seconds. The column labeled hess/mex shows the ratio of the execution time of the hess function to the execution time of our MEX code. The Table 6 .2 Execution time of the prototype implementation DGGHR3 of the blocked one-stage HT reduction algorithm described in [9] linked with multi-threaded BLAS on the workstation. column labeled hess/p = 4 shows the ratio of the execution time of the hess function to the execution time of our parallel code with p = 4 threads. The size of the improvement depends on both n and d. The observed improvements of the MEX code ranges from 4.0 (for n = 1000, d = 2) to 6.8 (for n = 500, d = 4). The observed improvements of the parallel code with four threads range from 6.3 (for n = 500, d = 2) to 12.3 (for n = 1000, d = 4). As expected, the increasing sparsity of the linearization for large degrees leads to increased improvements over the generic hess function. A larger size of the coefficient matrices does seem to slightly decrease the improvement of the MEX code while on the other hand increasing slightly the improvement of the parallel code.
We obtained a prototype implementation, DGGHR3, of the blocked one-stage algorithm presented in [9] , which represents the state of the art for generic HT reduction. The execution times for a varying number of threads are shown in Table 6 .2, which can be directly compared column by column to Table 6 .1 (parallel results). DGGHR3 significantly outperforms the hess function in MATLAB and performs only slightly worse than our structured code. The ratios of the execution times of DGGHR3 to our parallel code range from 0.9 (n = 1000, d = 2, p = 1) to 2.8 (n = 500, d = 4, p = 4) with a mean of 1.8.
6.2.
Results from the HPC system. On the HPC system we measured the performance of our parallel code and compared it against the LAPACK DGGHRD generic HT reduction routine as well as DGGHR3. The matrix B must be upper triangular on input, so we performed a structure-exploiting pre-processing step that reduces the top left n × n dense block of B to upper triangular form before calling DGGHRD or DGGHR3.
Total execution time.
The DGGHRD routine in LAPACK is based on Level 1 BLAS operations without explicit parallelism and thereby does not scale on multicore-based systems. These measurements should be compared to our parallel code solving problems of the same size with 1 ≤ p ≤ 24 threads. The time measurements include both the pre-processing step and the bulge-chasing part of the reduction (only the latter of which is parallel) and are summarized in Table 6 .4. The row with p = 1 can be directly compared to the timings in Table 6 .3 and shows the effects of blocking and exploiting structure. The ratios of the execution times range from 3.4 (for n = 2000, d = 4) to 6.5 (for n = 1000, d = 4).
The subsequent rows in Table 6 .4 show the parallel execution times for an increasingly large number of threads (up to two full processors) in our parallel code. In all cases except for n ≥ 1500 and d = 2, the parallel code reaches its scalability limit in the sense that the optimal number of threads is less than the maximum of 24 used in the experiments.
The largest observed improvement of the parallel code over DGGHRD is 28 (for n = 1000, d = 4, p = 16). The largest relative speedup of the parallel code (i.e., relative improvement over p = 1) ranges from 2.4 (n = 500, d = 2, p = 12) to 6.7 (n = 2000, d = 4, p = 16). Table 6 .5 shows the parallel execution time of the DGGHR3 prototype implementation with up to 12 threads (one full processor) in the multi-threaded version of the BLAS library. In all cases except for small values of n, the code reaches its scalability limit. The code scales modestly, which is expected considering that the code is only implicitly parallelized via the BLAS.
Comparing the minimum execution time of DGGHR3 with the minimum execution time of our parallel code, we find that the relative improvement of our code ranges from 2.6 (n = 2000, d = 2, with p = 8 threads in DGGHR3 and p = 18 threads in our parallel code) to 5.1 (n = 500, d = 4, with p = 4 threads in DGGHR3 and p = 12 threads in our parallel code). Table 6 .5 The parallel execution time on the HPC system of the DGGHR3 prototype implementation of the blocked one-stage algorithm described in [9] (including both the pre-processing step and the actual HT reduction). Despite most of the flops being accounted for by Phase II, it is typically Phase I that consumes the majority of the execution time due to its low arithmetic intensity and greater complexity. Figure 6 .2 shows the performance in Gflops, i.e., the ratio of the floating point operation counts from Figure 6 .1 and the measured execution times for a varying number of threads. The graphs on the left side correspond to Phase I and the graphs on the right correspond to Phase II. It is observed from the latter that Phase II scales up to p = 18. The graphs for Phase I also indicate good scaling up to p = 18 but there is also a sharp decrease in the performance towards the second half of the iterations.
We investigated the source of the performance degradations observed in Phase I by measuring the per-thread time spent in computational kernels per instance of Phase I. More specifically, consider any instance of Phase I and let T k denote the time that thread k spent computing in that instance. Then the total cost of computation is given by p k=1 T k . If we assume that the work could have been perfectly distributed over the p threads then 1 p p k=1 T k is a lower bound for the parallel execution time. The most heavily loaded thread computes for a total of max k T k time units. It follows that the ratio
in the range (0, 1] is a measure of the load imbalance with L = 1 indicating perfect load balance. In general, the most heavily loaded thread has 1/L times the optimal load. The amount of load imbalance for the case n = 2000, d = 4 and a varying number of threads is shown in Figure 6 .3. From the strong correlation between the increasing load imbalance and the decreasing performance of Phase I shown in Figure 6 .2(left), we conclude that the load imbalance is a primary explaining factor and runtime autotuning of the sub-block size might improve the performance further.
Conclusion and future work.
We have identified a special Fiedler linearization of matrix polynomials of arbitrary degree, which shares many of the properties of standard companion linearizations but also allows (a) the deflation of some of the zero and infinite eigenvalues without affecting the block structure of the linearization and (b) the exploitation of the sparsity structure in the HT reduction. We have presented a backward stable high-performance parallel algorithm for the HT reduction of such linearization. Our algorithm exploits the sparsity structure of the linearization and the redundancy in the computation of the transformation matrices, and thereby requires fewer flops than existing algorithms for unstructured pairs of matrices, including the state of the art HT reduction in [9] . Special care has been taken to improve cache reuse. Experiments on both a workstation and an HPC system have demonstrated that our structure-exploiting parallel implementation can outperform both the general LAPACK routine DGGHRD and the prototype implementation DGGHR3 of a general blocked algorithm.
Future work include runtime auto-tuning of the sub-block size to better balance the load, alternative parallelization schemes that address the trade-off between cache reuse and concurrency differently, and a hybrid approach that switches to a (parallel variant of) DGGHR3 during the last iterations where the structure is less pronounced. of the DGGHR3 prototype implementation. This research was conducted using the resources of the High-Performance Computing Center North (HPC2N).
Appendix A: Deflation of infinite and zero eigenvalues on special Fiedler linearization. If either one or both of P d and P 0 are rank deficient, then it is possible to deflate structurally induced zero and infinite eigenvalues prior to the HT reduction. This is desirable since these eigenvalues might otherwise go partially undetected due to roundoff errors accumulated during the QZ algorithm. We show in this appendix how to transform the dn × dn Fiedler pencil L(λ) in (1.2) into block upper triangular form
where the dn × dn matrices A and B are partitioned conformably. Here r 0 denote the rank of P 0 and r d the rank of P d . Note that if A 11 (or B 33 ) is singular then det P (λ) = γ det L(λ) ≡ 0 and hence P (λ) is singular. When A 11 and B 33 are nonsingular, (A.1) reveals n − r 0 zero eigenvalues and n − r d infinite eigenvalues. 
where X m,n denotes a dense matrix of size m × n and I n is an identity matrix of size n × n. We illustrate the proposed deflation procedure for the case d = 4, which is large enough to clearly exhibit the general pattern. The procedure can be readily generalized to any degree d ≥ 3. However, the extreme case d = 2 requires a little bit of special treatment, which we defer to the end of this appendix.
The appendix is outlined as follows. Fiedler linearizations and our special choice are defined in Section A.1. The deflation procedure consists of two steps:
1. Deflation of zero eigenvalues induced by P 0 described in Section A.2.
2. Deflation of infinite eigenvalues induced by P d described in Section A.3. The special treatment of the case d = 2 is described in Section A.4.
A.1. Fiedler linearizations. With the notation
and any permutation σ = (i 0 , i 1 , . . . , i d−1 ) of the indices (0, 1, . . . , d − 1) we define the associated Fiedler pencil by [1] 
The pencils L σ (λ) are all linearizations of P (λ) [1] . The Fiedler pencils we are interested in corresponds to the permutation σ = (d − 1, . . . , 3, 2, 0, 1), that is,
A.2. Deflation of zero eigenvalues induced by P 0 . If P 0 has full rank then we have nothing to do in this step. Therefore, assume that P 0 has rank r 0 < n and compress its rows upwards by pre-multiplication with a carefully crafted unitary matrix U * , i.e., choose a unitary matrix U such that
where P 0 has r 0 rows and full rank. (This can be accomplished for instance by using a singular value decomposition of P 0 or a QR factorization with column pivoting.) Transform (A, B) , where A + λB is the where the last block row and column has n − r 0 rows and columns, respectively. This effectively deflates n − r 0 zero eigenvalues at the bottom right corner. After removing the last n − r 0 rows and columns (colored red above) we obtain a reduced problem that takes the form
where each black block has size n × n, the red block has size r 0 × r 0 , and each blue block has size n × r 0 or r 0 × n. This completes our description of the first step of the deflation procedure.
A.3. Deflation of infinite eigenvalues induced by P d . The input to this step takes the form (A.4) for d = 4 and the aim is to deflate infinite eigenvalues induced by a rank deficiency in P d . If P d has full rank then there is nothing to do. Therefore, assume that P d has rank r d < n and compress the columns of P d towards the right by post-multiplication with a carefully crafted unitary matrix U , i.e., choose a unitary matrix U such that
where P d has r d columns and full rank. Transform (A, B) by post-multiplying the first block column with U . This changes the (2, 1) block of A from −I to − U . Restore the structure of A by pre-multiplying the second block row with U * . This simultaneously changes the (2, 2) block of B from I to U * . Continue restoring identities until finally the Then transform the first three block rows of A and B by a pre-multiplication with Q * . This changes the structure of (A, B) to
R X X X X X X X 0 X X X X X X X 0 X X X X X X X The sizes of the blocks in (A.6) are the same as the blocks in (A.4), except for the colored blocks which have shrunk. The red blocks have size r d × r d and one dimension of each blue block has reduced from n to r d . It is straightforward to verify that the block sizes agree with the output specification (A.2). This completes our description of the second step of the deflation procedure.
A.4. The extreme case d = 2. The extreme case d = 2 requires a little bit of additional care. The first step (Section A.2) of the deflation procedure can be applied the same as before. The output from the first step takes for d = 2 the form A = P 1 −U 1 P 0 0 , B = P 2 0 0 I , where the last block row and column have r 0 rows and columns. What concerns us is what transpires in the second step after having compressed the columns of P 2 . According to Section A.3 we should next compute a QR factorization of the top left 3 × 1 block submatrix of A. But for d = 2 the third block in that submatrix is not −I as in (A.5) but instead a dense block P 0,1 of size r 0 × (n − r 2 ). For d > 2 that block would have been equal to −I n−r d . What is important for the effect of the QR factorization on the sparsity structure is the number of leading nonzero rows in that 3×1 block submatrix. Special care for d = 2 is therefore needed if r 0 > n−r 2 . In that case the number of leading nonzero rows is greater than for the case d > 2 and thus a direct application of the second step would lead to more fill. To avoid this problem we first reduce the dense (tall-and-skinny) P 0,1 block to upper triangular form, which in itself has no effect on the generic sparsity structure. But it crucially reduces the number of nonzero leading rows in P 0,1 from r 0 to the desired n − r 2 and thereby overcomes the problem.
A.5. Algorithm. Algorithm 5 summarizes the discussion above by formalizing the deflation procedure. 
