Abstract-This paper introduces the first asynchronous, task-based formulation of the polar decomposition and its corresponding implementation on manycore architectures. Based on a new formulation of the iterative QR dynamically-weighted Halley algorithm (QDWH) for the calculation of the polar decomposition, the proposed implementation replaces the original and hostile LU factorization for the condition number estimator by the more adequate QR factorization to enable software portability across various architectures. Relying on fine-grained computations, the novel task-based implementation is also capable of taking advantage of the identity structure of the matrix involved during the QDWH iterations, which decreases the overall algorithmic complexity. Furthermore, the artifactual synchronization points have been weakened compared to previous implementations, unveiling look-ahead opportunities for better hardware occupancy. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. The StarPU dynamic runtime system is employed to traverse the DAG, to track the various data dependencies and to asynchronously schedule the computational tasks on the underlying hardware resources, resulting in an out-of-order task scheduling. Benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., Intel MKL and Elemental) for the polar decomposition on latest shared-memory vendors' systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy.
INTRODUCTION
T ODAY'S most powerful supercomputers are composed of fat computational nodes over-provisioned by floatingpoint units [1] , which may distort the balance of characteristics systems with respect to other hardware resources, such as memory per core, aggregated bandwidth, I/O nodes, interconnect, etc. Although scientific applications are often memory-bound with low arithmetic intensity kernels, and therefore limited by the bus bandwidth, we revisit the polar decomposition, an important dense linear algebra (DLA) algorithm, which can make an effective use of the predominant floating-point units provided by the current state-of-the-art hardware commercial chips (for instance, Intel Knights Landing and NVIDIA Pascal P100). The polar decomposition consists in decomposing a dense matrix A ¼ U p H, where U p is the orthogonal polar factor and H is the positive semi-definite Hermitian polar factor. One way to achieve this decomposition is the QR-based dynamically weighted Halley (QDWH) iteration introduced by Nakatsukasa et al. [2] . The polar decomposition is a key algorithm for various scientific applications, e.g.; in continuum mechanics to decompose stress tensors and to simulate the deformation of an object; in aerospace computations [3] during strapdown inertial navigation; and other aerospace systems to describe the rotation of one coordinate system relative to a reference coordinate system; and in chemistry [4] to help the understanding of properties in terms of electron pair (chemical bond) transferability, etc. Further applications are also reported by Higham in [5] .
This paper describes the first asynchronous, task-based formulation of the QDWH-based polar decomposition and its corresponding implementation on manycore architectures. The standard algorithm requires up to six iterations to converge in double precision and to calculate the polar factor, depending on the condition number of the input matrix, involving O(n 3 ) matrix operations at each operation. Its algorithmic complexity may, therefore, be prohibitive. Nevertheless, this challenge can be compensated for the high level of concurrency exposed at each iteration [6] , [7] . This paper proposes to considerably improve previous works [2] , [6] , [7] from two distinct algorithmic and implementation perspectives. The former consists in replacing the hostile LU-based matrix condition number estimation by an adequate QR-based implementation for broader code portability across vendor architectures. This enables to remove the dependence on the partial pivoting, which may be poorly supported, for example, on high-end GPU-based systems, without increasing the overall algorithmic complexity. The latter algorithmic improvement has twofold aspects: (1) it permits to take advantage and exploit the structure of the identity matrix involved at each QR-based QDWH iteration, which significantly reduces the algorithmic complexity, thanks to fine-grained computations associated with a dynamic asynchronous execution; and (2) the artifactual synchronization points are weakened, unveiling look-ahead opportunities for better hardware occupancy. Although this paper studies the performance impact of weakening synchronization points through fine-grained computations in the context of dense linear algebra, the authors in [8] , [9] have also demonstrated it for the Conjugate Gradient solver. The overall QDWH-based polar decomposition can then be represented as a directed acyclic graph (DAG), where nodes represent computational tasks and edges define the inter-task data dependencies. We employ the StarPU dynamic runtime system to unroll the DAG. StarPU tracks the various data dependencies and asynchronously schedules the computational tasks on the underlying hardware resources. The algorithm may then be executed with an out-of-order task scheduling.
StarPU increases user-productivity by establishing a separation of concerns consisting in hiding the hardware complexity from library developers. This enables end-users to target various hardware architectures with a single source code. Extensive benchmarking experiments show significant improvements against existing state-of-the-art high performance implementations (i.e., MKL and Elemental) for the polar decomposition on latest shared-memory systems (i.e., Intel Haswell/Broadwell/Knights Landing, NVIDIA K80/P100 GPUs and IBM Power8), while maintaining high numerical accuracy for well and ill conditioned matrices.
The remainder of the paper is organized as follows. Section 2 presents related work. Section 3 highlights our research contributions. Section 4 briefly recalls the polar decomposition and its main computational phases. Section 5 describes the algorithmic paradigm shift that current state-of-the-art DLA software libraries have witnessed, as implemented in LAPACK [10] , MAGMA [11] and more recently in PLASMA [12] and Chameleon [13] libraries. The StarPU dynamic runtime system is illustrated in Section 6 as a scheduling engine for task-based programming model. The implementation details of the high performance task-based asynchronous QDWH are given in Section 7. Section 8 provides new upper-bound for the QDWH algorithmic complexity. Numerical accuracy, implementations assessments and performance comparisons with existing state-of-the-art DLA software are given in Section 9 and we conclude in Section 10.
RELATED WORK
The polar decomposition algorithm has been well studied in the last three decades in terms of complexity and numerical robustness/accuracy [14] , [15] , [16] , [17] , [18] , [19] , [20] . Initially designed with Newton's method based on an explicit matrix inversion calculation, numerical instability was reported, especially in presence of ill-conditioned matrices. An algorithm based on Halley's iteration was then introduced with asymptotically cubic rate of convergence in obtaining the final polar factor. To solve the numerical accuracy issues due to the matrix inversion computation, an inverse-free QR-based dynamically weighted Halley has finally been proposed by Nakatsukasa et al. [2] . However, the polar decomposition has not been implemented in a high performance computing environment, most likely due to its excessive algorithmic complexity, which does not reflect a practical assessment of the method. More recently, Nakatsukasa and Higham [21] have shown that QDWH can be used as a building block for the dense symmetric eigensolver and singular value decomposition [22] , [23] , which has brought to the fore further research directions. Indeed, previous works from the authors have implemented QDWH-based singular value decomposition on hardware accelerators [6] and distributed-memory systems [7] , where the calculation of the polar factor is the most-time consuming phase. The aforementioned implementations have somewhat demonstrated limited performance scalability on multiple GPUs and large clusters. This is mostly due to the low hardware resource occupancy achieved by the inherent bulk synchronous programming model (BSP), which both implementations rely on for parallel performance. By the same token, it is also noteworthy to mention that the high performance software library Elemental [24] provides a QDWH implementation for distributedmemory systems.
Last but not least, the polar decomposition can alternatively be computed through an SVD as follows:
This strategy has shown some performance scalability issues, due to the slow convergence of the QR algorithm on the condensed bidiagonal form [7] .
CONTRIBUTIONS
This section enumerates our contributions, given the previous related work section, which represent the crux of the paper: (1) improve the standard QDWH algorithm by replacing the LU-based condition estimator with QR, without increasing the overall algorithmic complexity, while enabling software portability across hardware architectures; (2) develop the first task-based QDWH implementation based on fine-grained computations, which enables to exploit the identity data structure during the QDWH iterations, reducing up to 20 percent the overall complexity; (3) rely on a dynamic runtime system (i.e., StarPU) to asynchronously schedule the computational tasks among available processing units in order to improve hardware occupancy; and (4) provide a comprehensive performance assessment and comparisons on a myriad of high-end architectures.
THE QDWH-BASED POLAR DECOMPOSITION
This section focuses on the inverse-free QDWH-based iterative procedure to calculate the polar decomposition [2] , [21] of a matrix A 2 R mÂn ðm ! nÞ, such that A ¼ U p H. To ensure the paper is self-contained, we briefly recall the convergent sequence as follows, with A the initial matrix:
where U p ¼ lim k!1 ðU k Þ, and
then be found with the two steps formula:
The main goal consists in calculating the optimal parameters (a k ; b k ; c k ) so that cubical convergence can be attained during the QDWH iteration. The expression of the parameters ða k ; b k ; c k Þ can be written as follows:
with b ¼ 1=kA À1 k 2 . For ill-conditioned matrices, the number of iterations k can vary up to six. We refer to [2] for further details on the theoretical proof. When U k becomes wellconditioned, it is possible to replace Eq. (2) with a Choleskybased implementation as follows:
This algorithmic switch at runtime allows to further speed up the overall computation, thanks to a lower algorithmic complexity, while still maintaining numerical stability. In practice, this transition is monitored by setting a threshold for c k . Once convergence is reached, the polar factor is U p ¼ U k and the positive semi-definite Hermitian polar factor corresponds to H ¼ U > p A. All in all, the number of floating-point operations depends on the number of iterations required to converge, which is dictated by the condition number of the original matrix problem. Typically, for ill-conditioned matrices, QDWH will perform three QR-based QDWH iterations (Eq. (2)), followed by three Cholesky-based QDWH iterations (Eq. (5)), besides executing other compute-intensive Level 3 BLAS operations, i.e., triangular solves, applications of Householder reflectors, matrix-matrix multiplications, etc.
In fact, although the QDWH-based polar decomposition is a challenging and complex algorithm, it relies on conventional dense linear algebra operations, e.g., QR/Choleskybased linear solvers. These building block operations are well-supported by several open-source and vendoroptimized state-of-the-art numerical libraries.
Thanks to its highly-parallel and compute-bound kernels, the QDWH-based polar decomposition may take advantage of the current manycore era and a foreseen period of convergence, where hardware/software co-design plays now a major role in designing future systems and numerical libraries for exascale.
ALGORITHMIC PARADIGM SHIFT
High performance dense linear algebra software libraries have witnessed an algorithmic paradigm shift in response to hardware evolution, moving from block to tile algorithms.
Block Algorithms
Block algorithms rely on successive panel and update sequences to perform matrix computations. The panel phase is memory-bound and does not benefit from thread parallelism, while the phase of the trailing submatrix update is highly parallel, in which computations are applied by means of multithreaded Level 3 BLAS kernel executions. These sequences are characteristic of the fork-join paradigm, alternating sequential and parallel computational phases, and therefore, suffer from performance losses due to low hardware occupancy engendered by unnecessary in-between synchronization points. In fact, this BSP model corresponds to the backbone of many open-source and vendors' state-of-the-art numerical libraries such as LAPACK [10] , MAGMA [11] and ScaLAPACK [25] for shared-memory, accelerator-based and distributed-memory systems, respectively.
As highlighted in the exascale software roadmap [26] , which summarizes the HPC community consensus on an urgent call for sustainable software development for extreme scale, the BSP model may need to be reconsidered, especially in presence of millions of cores, which already constitute today's supercomputers [1] .
Tile Algorithms
To answer this call for action and provide a solution for the challenge brought by the manycore era, the DLA community has initiated a decade ago a profound redesign of matrix computation algorithms in order to benefit from the high level of concurrency. This translated into breaking down the dense matrix data structure into tiles following a tile data layout as opposed to the standard column-major format, which is the standard for block algorithms. The various matrix operations can then be represented as a directed acyclic graph (DAG), where nodes represent sequential computational tasks and edges define the inter-task data dependencies. The resulting fine-grained computations permit to weaken the artifactual synchronization points by bringing to the fore opportunities for look-ahead, where subsequent tasks may already have their data dependencies satisfied and be ready for execution. In return, this can be exploited by dynamic runtime systems in keeping threads in a busy state throughout the entire execution. The performance gain of block versus tile algorithms has been thoroughly addressed in the literature [27] , [28] , [29] , in the context of PLASMA [30] and FLAME [31] numerical software libraries.
The Chameleon Library
More recently, in a community effort to enhance user productivity by abstracting the hardware complexity, the Chameleon library [13] has been developed to target multiple hardware architectures with a single source code. This is achieved by standardizing existing dynamic runtime system APIs (e.g., OpenMP [32] , OmpSs [33] , [34] , [35] , QUARK [36] , StarPU [37] , [38] , PaRSEC [39] , SuperMatrix [40] ) through a thin layer of abstraction, making the user developer experience oblivious to the underneath runtime system and its corresponding hardware deployment. For instance, this hardware/runtime-oblivious software infrastructure has been already used with StarPU [41] , and more recently with OmpSs [42] , in the context of computational astronomy applications. And this is in the Chameleon software environment that we deploy our QDWH implementation.
THE STARPU DYNAMIC RUNTIME SYSTEM
Dynamic runtime systems are critical scheduling engines in supporting task-based programming models at large scale [43] . In particular, StarPU [37] is the de facto dynamic runtime system for Chameleon. StarPU deals with the execution of generic task graphs, given through the sequential task flow (STF) programming model where tasks are inserted to the runtime in a sequential manner with additional hints on the data directions (i.e., read, write, readwrite). StarPU is then in charge of dynamically scheduling the tasks while enforcing those dependencies. Although Chameleon supports other runtimes (e.g., PaRSEC [39] , QUARK [36] ), we decided to solely rely on the StarPU runtime system to implement this algorithm, since it is probably one of the most mature runtime systems when it comes to supporting various hardware architectures. Comparing the performance of various runtime systems on a given hardware is interesting in the context of QDWH, but beyond the scope of this paper.
One of the main advantages of using the task-based implementation is to become oblivious of the targeted architectures. This improves the user productivity, and it is even more realistic for runtimes such as StarPU, which are able to transparently handle single heterogeneous nodes, and eventually multiple heterogeneous nodes in case the StarPU-MPI [38] extension is used. To enable such portability, StarPU tasks are associated to codelets which groups under the same name multiple implementations of the same task: CPU, CUDA, OpenCL, OpenMP, etc. At runtime, StarPU will automatically decide which implementation of the task is better suited to achieve the highest performance based on cost models. These cost models are automatically generated by StarPU when executing the application and kept for subsequent executions. These models are especially important to the Heterogeneous First Time [44] (HeFT) scheduling strategy used by StarPU, when accelerators are involved in the computations.
Further benefits to using such programming models are the capabilities offered to the programmer to submit simultaneously independent steps of an application. This permits to raise the resources occupancy, and adds a single synchronization point when all steps are performed. The MORSE_xxxx_Tile_Async interface of the Chameleon library offers this capability to interleave multiple dense linear algebra operations when it is possible. Conversely, the synchronous interface, MORSE_xxxx_Tile, enforces a synchronization call at the end of the function to wait for the end of all submitted tasks.
HIGH PERFORMANCE IMPLEMENTATION
In this section, we describe the task-based implementation of the QDWH algorithm and the novel optimizations introduced to increase hardware occupancy and overall performance, in the context of the Chameleon library [13] . 
30: 31: /* Compute U k from U kÀ1 */ 32: if c > 100 then 33:
37: else 38:
dlaset_Async( Z, 0., 1. ) " Z = I 39:
42: 
Algorithm 1 presents the pseudo-code of the task-based QDWH implementation on top of the Chameleon library. It is decomposed in three main code sections. The first one from row 1 to 6 evaluates the two-norm of the input matrix A, as in Eq. (1) from Section 4, that is required to start the iterative process. The two-norm estimator corresponds to the largest singular value of the matrix. It relies on the power iteration, which involves repeated multiplication by the matrix A and A > . The power iteration converges when the difference between two successive estimates fall within the specified relative tolerance. We have introduced genm2 in the Chameleon library through an iterative procedure, in which we minimized the number of synchronizations, thanks to fine-grained computations and look-ahead techniques. The second section of the algorithm computes the initial condition number l 0 from row 7 to 19, as in Eq. (4) from Section 4. The classical way consists in computing an LU factorization of the matrix A, and its one-norm. Then, it is possible to compute an estimator of the condition number with dgecon by means of those two results. The main challenge here resides in the LU factorization with partial pivoting, which is difficult to implement using task-based programming models. Indeed, searches for pivot candidates and row swapping generate many global synchronization points within the panel factorization and its resulting updates. Some solutions have been proposed on shared-memory systems [45] but there are no existing solutions that are oblivious of heterogeneous architectures. We thus propose a QR-based solution which consists in estimating the norm of A À1 by computing the norm of R À1 with A ¼ QR. This solution, which turns out to be less costly, alleviates the pivoting issue all together, uses only regular tile algorithms and allows code portability across various architectures, thanks to the underlying runtime system. The third section of the algorithm, rows 21 to 48, is the main loop of the algorithm, which iterates on U k and converges to the polar factors. This section of the algorithm is straightforward and follows the mathematical description of the problem using either a QR or a Cholesky factorization to calculate the next U, as in Eq. (2) or Eq. (5) from Section 4, respectively. Finally, the last section, rows 49 to 53, computes the Hermitian polar factor H from the polar factor computed out of the main loop.
Code Optimizations
The Chameleon library provides two APIs to perform dense matrix computations. The first one, MORSE_xxxx_Tile, is a synchronous implementation of a linear algebra operation. This means that all the tasks required for the computations are submitted to the runtime, and then the library internally waits for the completion of all tasks before returning the control to the programmer. This is the first version we implemented in the Algorithm 1. To highlight the benefit of using a task-based programming model (through tile algorithms) as opposed to the fork-join paradigm, as implemented in the LAPACK library, we have manually integrated synchronization points within the QR/Cholesky factorization kernel calls, at the end of each panel and update computations, to better emphasize on the performance discrepancy between both approaches. We refer to this reference implementation as Sync.
The second optimization is the possible acceleration of the QR-based Halley iterations. This optimization consists in exploiting the identity matrix structure of the C 2 matrix in the QR factorization (line 34 in Algorithm 1) and the corresponding Q generation (line 35 in Algorithm 1). Indeed, thanks to tile algorithms, it is possible to design a specific QR factorization algorithm in order to factorize a dense matrix on top of an identity matrix. This new QR factorization takes into account the identity matrix structure so that only non-zero tiles are operated on during the factorization. By the same token, during the Q generation step, only the non-zero tiles containing the Householder reflectors will be accessed. This optimization is important as it reduces the number of FLOPs as well as data movement. We refer to this implementation as OptId.
The last optimization, MORSE_xxxx_Tile_Async, ensures that all the tasks of an algorithm are submitted to the runtime, but their completion is not ensured when the function call returns. Thus, it is possible to simultaneously submit tasks of multiple operations. This may unveil look-ahead opportunities at runtime, once the data dependencies are satisfied, and may engender out-of-order task execution. Indeed, the runtime is in charge of keeping the data coherency of tasks, generated from different kernel calls, since these tasks may operate on the same data. Operations that are asynchronously submitted to the runtime are indicated in bold font in Algorithm 1. At some point of the algorithm, synchronization points are however required to guarantee the consistency of the results. This is made through a call to RUNTIME_sequence_wait(), which waits for the completion of all tasks. It is then possible to release synchronization in the three steps of the algorithm to ensure a better occupancy of the resources, especially on small to medium test cases, as presented in Section 9. We refer to this implementation as Async. It is also noteworthy to mention that it is possible to estimate offline the minimal number of iterations performed in the main loop. In that case, the synchronization in line 45 can be safely removed for the first iterations and introduced only for the last iteration as a sanity check on the value conv against the convergence threshold. These three code optimizations (i.e., Sync, OptId and Async) can be combined for further performance. While Sync and Async have a direct impact on task scheduling, OptId actually changes the algorithm and reduces the algorithmic complexity.
ARITHMETIC COMPLEXITY
In this section, we present the algorithmic complexity (FLOPs) of the polar decomposition using two variants based on the Halley iteration (QDWH) and the SVD. For simplicity purposes, we consider only square dense matrices, but QDWH works also for rectangular matrices [21] .
The QDWH-Based Polar Decomposition
The condition number estimation l 0 can be calculated using the LU factorization, which requires 2 3 n 3 operations, followed by two triangular solvers LX ¼ Id and UA À1 ¼ X, adding 2n
3 FLOPs. Alternatively, l 0 can be calculated using the QR factorization, A ¼ QR which needs 4 3 n 3 operations, followed by inverting the upper triangular matrix R with 1 3 n 3 operations. Calculating l 0 using the QR factorization needs less FLOPs overall. Moreover, the resulting QR factors can be reused during the first iteration of QDWH, thanks to fine-grained computations.
As shown in 2, the QDWH FLOPs using QR-based iteration includes the QR decomposition of 2n Â n matrix for a cost of ð3 þ 
, where #it QR and #it Chol correspond to the number of QR-based and Cholesky-based iterations, respectively. As discussed in [21] , the flop count of QDWH depends on l 0 , which involves during the QDWH iteration. The total flop count of QDWH for dense matrices ranges then from ð10 þ FLOPs. Accumulating the Householder reflectors to form
, as explained in [21] . Table 1 summarizes the total flop count of QDWH (including condition number estimation and Halley iteration) (1) when using LU to estimate l 0 (original implementation), (2) when using QR to estimate l 0 and reusing the QR factors in the first iteration of QDWH and (3) when additionally taking advantage of the identity matrix structure in QR-based iterations (2).
The SVD-Based Polar Decomposition
The polar decomposition can be calculated via SVD as fol-
Therefore, the flop count of this approach includes the cost of an SVD decomposition, a matrix-matrix multiplication to compute the orthogonal polar factor U p and a matrix-matrix multiplication to calculate the Hermitian polar factor H.
The standard approach to compute the SVD of a dense matrix is to first reduce it to bidiagonal form
The subsequent left and right singular vectors from the bidiagonal solver are then accumulated during the back transformation phase, i.e., U ¼ U 1 U 2 and V ¼ V 2 V 1 , to calculate the singular vectors of the original matrix A. The final estimated flop count to calculate the SVD is 22n 3 , as implemented in the divide-and-conquer DGESDD [46] . Then, we need to add 2n 3 operations to compute U p ¼ UV , and n 3 to compute H ¼ V > SV (symmetric rank-k update operation). The final estimated cost of the polar decomposition using SVD is, therefore, 25n 3 . Compared to the QDWH-based polar decomposition (3) in Table 1 , this is 30 percent less than in case of illconditioned matrices and almost twice the FLOPs in case of well-conditioned matrices. In theory, it seems there is a clear advantage to use SVD-based for the polar decomposition in presence of ill-conditioned matrices. However, the SVD algorithm inherently suffers from lack of parallelism, due to a very expensive panel factorization phase and may not be as competitive as QDWH-based approaches.
PERFORMANCE RESULTS AND ANALYSIS
This section provides a comprehensive performance analysis of the task-based QDWH implementation in the context of the Chameleon library with the dynamic runtime system StarPU on various architectures.
Environment Settings
We have considered three different single node systems, which are representative of the current manycore-based hardware trends. The first system is composed of dual-socket 16-core Intel Haswell Xeon CPU E5-2698 v3 running at 2.30 GHz equipped with 8 K80 dual-boards with 16 effective GPUs. In the following, we call this system Haswell when no GPU are used, and we add the suffix '+8xK80' whenever both CPUs and GPUs are exploited. The second system hosts the latest Intel commodity chip with dual-socket 14-core Intel Broadwell Xeon E5-2680 v4 running at 2.4 GHz. The third system has the latest Intel manycore Knights Landing (KNL) 7210 chips with 64 cores. For simplicity purposes, each system is named after its chip codename.
All numerical accuracy and parallel performance (time in seconds and GFLOP/s) graphs report experiments performed on the whole system, unless mentioned otherwise. The GFLOP/s graphs are shown in linear scale while the accuracy and time to solution graphs are shown in logarithmic scale.
Our QDWH implementation has been compiled with Intel compiler 16 and linked against the Chameleon library v0.9.0 with hwloc v1.11.4, StarPU v1.2.0 and Intel MKL v11.3.1. Each dense synthetic matrix A ¼ QDQ > is generated by initially setting a diagonal matrix D ¼ diagðSÞ containing the singular values, with a specific condition number and from an orthogonal matrix Q generated by calculating the QR factorization of a random entries.
We have considered well and ill-conditioned randomly generated matrices, with the latter representing the worse case scenario, where QDWH performs a maximum of six iterations. In particular, in the subsequent experiments, our QDWH implementation switches Equations from (2) to (5) from Section 4 if c k is smaller than 100 (see Algorithm 1), which generates QR-based iterations for the first three followed by three Cholesky-based iterations. The QDWH performance graphs in GFlop/s are estimated as the ratio of the algorithmic complexity (see Section 8.1) divided by the execution time. All performance runs have been repeated three times and only the average is reported, since consistent timing results are obtained.
Numerical Accuracy
We recall the polar decomposition of a given general matrix A 2 R nÂn : A ¼ U p H. The norm k:k F denotes the Frobenius norm. To highlight the numerical robustness of the method, we use the following two accuracy metrics:
for the orthogonality of the polar factor U p and
for the accuracy of the overall computed polar decomposition. Fig. 1 presents the orthogonality of U p and the accuracy of the polar decomposition A ¼ U p H for ill-conditioned matrix on the KNL system (very similar numerical results on other systems). We can distinguish two clusters, i.e., QDWHbased and SVD-base polar decomposition, with up to two digits difference in the orthogonality and accuracy magnitudes. Although both mostly employ orthogonal transformations, the SVD variant of the polar decomposition necessitates the DC algorithm, which may show some convergence limitations with ill-conditioned matrices, as shown later in Section 9.6.
Incremental Optimizations
Fig . 2 highlights the performance impact of various incremental optimizations on the task-based QDWH, as described in Section 7.2. Taking advantage of the identity matrix structure (OptId), by only operating on non-zero tiles, engenders up to 20 percent performance improvements compared to the oblivious approach on all studied systems. Thanks to fine-grained computations, look-ahead opportunities are unveiled, which directly translate into asynchronous out-of-order task execution at runtime. In fact, this has been previously reported in the literature [40] , [42] , [47] for many of the dense linear algebra matrix operations, which compose our task-based QDWH implementation (e.g., Cholesky, QR, etc). Therefore, running additionally in asynchronous mode (Async) further reduces time to solution (up to 2.8x). This is true, especially for medium range of matrix sizes, where processing units run out of work and look-ahead techniques jump right in to fill the performance gap. For asymptotic matrix sizes, although work is abundant, the asynchronous mode still provides additional performance. In particular, on KNL and Haswell+8xK80 systems, data movement engendered by NUMA and PCIe channels is expensive and can be overlapped by computations, thanks to the Async optimization. Fig. 3 shows the execution traces when running in synchronous (Sync API) and asynchronous (Async API) modes. We have added additional synchronization points within the Sync API, after each panel/update computation, so that we can better capture the performance gain against coarsegrained computations engendered by block algorithms, as described in Section 5. These traces have been obtained on the KNL system for a matrix size of 10K. Since the matrix is ill-conditioned, the task-based QDWH performs six iterations (three QR-based and three Cholesky-based). The green, yellow and blue blocks correspond to QR, Cholesky/Level 3 BLAS and Level 1/2 BLAS, respectively. We can clearly notice the idle time during the first three QR-based iterations when running with a synchronous mode (Fig. 3(a) ). The performance impact of synchronous execution for the next three Cholesky-based iterations is not as severe as QR-based iterations because the Cholesky panel factorization involves only the diagonal block (Fig. 3(b) ).
Execution Traces
For the subsequent graphs, the performance curves of the task-based QDWH correspond to performance when all optimizations are enabled (i.e., Async and OptId). Scalability   Fig. 4 demonstrates the performance scalability of the taskbased QDWH implementation. The scalability is almost linear for the commodity CPU systems. For instance, 7.5/ 3.7-fold speedups are achieved using 32/28 threads from the reference points of 4/7 threads on the Haswell/Broadwell systems, respectively. On the KNL system, the taskbased QDWH implementation obtains a 1.9-fold speedup on 64 threads, compared to 32 threads.
Performance
On the densely GPU populated Haswell+8xK80 system, with a total of 16 GPUs, moving data between host and device memory turns out to be challenging. The performance bottleneck has been reported in a recent study [48] . By using the StarPU framework, the data movement overhead of moving data through the thin PCIe bus is partially hidden. StarPU is able to cope with some of these communication overheads by mitigating and adjusting to the memory congestion, thanks to its asynchronous mode of execution. The obtained speedup is 1.4-fold on 16 GPUs, compared to 8 GPUs. Further optimizations are possible by integrating into StarPU some of the performance models discussed in [48] . Fig. 5 reports task-based QDWH performance against other various existing QDWH implementations and SVD-based polar decomposition on ill (left) and well (right) conditioned matrices, across the three systems. The missing data points for the polar decomposition variant based on the SVD from MKL correspond to runs, which did not achieve the proper accuracy, as defined in Section 9.2, probably due to the convergence failure encountered by the algorithm in the SVD. The corresponding variant with Elemental does not face this problem because it uses a different implementation of the QR algorithm in the SVD instead. For well-conditioned matrices, time to solution is much more shortened for the QDWH implementation variants, thanks to less iterations for convergence. The SVD variants of the polar decomposition do not seem to take advantage of such matrices since the bidiagonal reduction and the matrix-matrix multiplication have still to be performed in the same manner, regardless of the matrix condition number.
Performance Comparisons of QDWH Variants
All in all, the task-based QDWH achieves gains up to [6, It is noteworthy to mention that our task-based QDWH implementation using Chameleon on the 16 GPUs does not perform well on small matrix sizes. At that matrix scale, the workloads are too small to saturate all the devices' floatingpoint units, and therefore, performance is ultimately limited by the overhead of off-loading data back and forth.
Also, compared to MAGMA_QDWH [6] , the task-based QDWH achieves gains up to [71, 22 percent] on Haswell +4xK80 for [ill, well]-conditioned matrices, respectively.
Performance Comparisons Across Architectures
We have additionally considered two more recent architectures, i.e., a dual-socket 10-cores IBM Power8 (3.69 GHz) and a dual-socket 16-cores Intel Haswell equipped with four NVIDIA Pascal P100 GPUs. Fig. 6 presents the performance of the task-based QDWH across all systems investigated in the paper. The main idea is not to cross-compare the performance delivered by each system but rather to show that the task-based QDWH can support not only various architectures but also can achieve decent sustained peak (up to 90 percent and up to 60 percent of the sustained Chameleon DGEMM peak for CPU only systems and for KNL/GPUs platforms, respectively).
CONCLUSION AND FUTURE WORK
We have presented a comprehensive performance analysis of a novel asynchronous task-based QDWH algorithm for the polar decomposition of a dense matrix. Thanks to finegrained computations, we have reduced by 20 percent the overall complexity by taking advantage of the identity structure of the matrix during the iterations, while exposing look-ahead opportunities to increase hardware occupancy. Furthermore, the Chameleon library and its dynamic runtime system StarPU abstracts the hardware complexity from end-users and is capable of asynchronously scheduling computational tasks on the underlying processing units. Thanks to its wide hardware range support, we demonstrated that StarPU can port a single sequential source code to a myriad of hardware systems. Experimental results of the asynchronous task-based QDWH show significant performance improvement (up to an order of magnitude) against state-of-the-art implementations on ill and well-conditioned matrices across various hardware technologies, which are paving the road to future petascale/ exascale systems. Future work includes using the task-based QDWH implementation as a building block for the dense symmetric eigensolver and SVD on shared and distributed-memory systems. In particular, the extension to distributed-memory systems will necessitate the redesign of the existing QR factorization in the Chameleon library, which is currently based on a flat tree [27] . This will engender excessive communications in distributed-memory systems, although it does not prevent our task-based QDWH implementation from getting high performance for the matrix sizes and systems' configurations studied in this paper. The idea will be to adapt a hierarchical reduction tree [49] into our customized QR factorization to privilege intra-node communications.
Last but not least, we would like to investigate other QDWH algorithmic variants, which may require more FLOPs but entails an even higher level of concurrency [50] .
Dalal Sukkari received the BS degree in mathematics from Hashemite University of Jordan and the MSc degree in applied mathematics from King Abdullah University of Science and Technology, in July 2013 with Dean's Academic Excellence Award. She is working toward the PhD degree in applied mathematics at King Abdullah University of Science and Technology. She has worked on a new high performance implementation of the QR-based dynamically weighted halley singular value decomposition (QDWH-SVD) solver on multicore architecture enhanced with multiple GPUs. She has introduced a high performance QDWH implementation on distributed memory based on the state-of-the-art vendor-optimized numerical library ScaLAPACK.
Hatem Ltaief is a senior research scientist in the Extreme Computing Research Center, King Abdullah University of Science and Technology. His research interests include parallel numerical algorithms, parallel programming models, and performance optimizations for multicore architectures and hardware accelerators. His current research collaborators include INRIA, Observatoire de Paris, Cray, NVIDIA, and Intel.
Mathieu Faverge is an assistant professor with Bordeaux INP and is member of the Inria HiePACS team since 2012. His main research interests include numerical linear algebra algorithms for sparse and dense problems on massively parallel architectures, and especially DAG algorithms relying on dynamic schedulers. He has experience with hierarchical shared memory, heterogeneous and distributed systems, and his contributions to the scientific community include efficient linear algebra algorithms for those systems.
David Keyes received BSE degree from Princeton and the PhD degree from Harvard. He is the director of the Extreme Computing Research Center, King Abdullah University of Science and Technology and an adjoint professor of applied mathematics with Columbia University. He works at the algorithmic interface between parallel computing and the numerical analysis of partial differential equations. He is a fellow of SIAM and AMS and has won the Gordon Bell Prize (ACM) and the Sidney Fernbach Award (IEEE).
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
