Abstract-LU factorization with partial pivoting is a canonical numerical procedure and the main component of the high performance LINPACK benchmark. This paper presents an implementation of the algorithm for a hybrid, shared memory, system with standard CPU cores and GPU accelerators. The difficulty of implementing the algorithm for such a system lies in the disproportion between the computational power of the CPUs, compared to the GPUs, and in the meager bandwidth of the communication link between their memory systems. An additional challenge comes from the complexity of the memory-bound and synchronization-rich nature of the panel factorization component of the block LU algorithm, imposed by the use of partial pivoting. The challenges are tackled with the use of a data layout geared toward complex memory hierarchies, autotuning of GPU kernels, fine-grain parallelization of memorybound CPU operations and dynamic scheduling of tasks to different devices. Performance in excess of one TeraFLOPS is achieved using four AMD Magny Cours CPUs and four NVIDIA Fermi GPUs.
INTRODUCTION
T HIS paper presents an implementation of the canonical formulation of the LU factorization, which relies on partial pivoting for numerical stability. It is equivalent to the DGETRF function from the LAPACK numerical library [1] . Since the algorithm is coded in double precision, it can serve as the basis for an implementation of the high performance LINPACK benchmark (HPL) [13] . The target platform is a system combining one CPU board with four 12-core CPUs and one GPU board with four 14-core GPUs, for the total number of 104 hybrid cores. Here, GPU core means a device that can independently schedule instructions, which in NVIDIA nomenclature is called a streaming multiprocessor. It is not to be confused with a CUDA core. The memory system of the CPUs, referred to as the host memory is a cache-coherent nonuniform memory access shared memory system. The GPUs have their private memories, referred to as device memories. Communication between the host memory and the device memories is handled by direct memory access (DMA) engines of the GPUs and crosses the PCI express bus.
Numerous challenges are posed both by the target hardware and the target algorithm. Although presenting a similar number of cores, the GPUs have an order of magnitude higher floating-point peak performance. The disproportion is exacerbated by the fact that GPUs are tasked with regular, data-parallel and compute intensive work, while CPU are tasked with irregular, synchronization-rich and memory-bound work. The algorithm itself is challenging, specifically the technique of partial pivoting, which introduces irregular processing patterns and hard synchronization points. These challenges are tackled with a combination of both well established and novel techniques in parallel dense linear algebra, such as:
. tile matrix layout, . GPU kernel autotuning, . parallel recursive panel factorization, . the technique of lookahead, . dynamic (superscalar) task scheduling, . communication and computation overlapping. Notably, the level of performance reported in this work could be accomplished thanks to recently developed capabilities, such as a GPU kernel autotuning methodology and superscalar scheduling techniques.
Motivation
Two trends can be clearly observed in microprocessor technology: steadily increasing number of cores and integration of hybrid cores in a single chip. Current commodity processors go as high as 16 cores (e.g., AMD Interlagos) and all major microprocessor companies develop hybrid chips (NVIDIA Tegra, AMD Fusion, Intel MIC). It is to be expected, then, that in a few years hybrid chips with O(100) cores will be the norm, which is why the platform of choice for this paper is a system with 104 cores, 48 classic superscalar cores and 56 accelerator (GPU) cores. At the same time, accelerators are steadily gaining traction in many areas of scientific computing [2] , [19] , [21] , [32] .
Although the HPL is commonly perceived as an artificial benchmark, there are actually numerous examples of applications relying heavily on the Gaussian elimination for the solution of a large dense system of linear equations. For instance, the electromagnetics community is a major user of dense linear solvers. Of particular interest is the computation of the radar cross section of an aircraft or a sea vessel. The problem is commonly solved using the boundary element method, sometimes also referred to as the method of moments, which produces a large dense system of linear equations, and employs the Gaussian elimination for its solution.
The problem of achieving a self-sustaining fusion reaction provides another good example. "... the All ORders Spectral Algorithm (AORSA) simulation program, developed within the Scientific Discovery through Advanced Computing (SciDAC) Numerical Computation of Wave Plasma-Interactions in Multidimensional Systems project, has demonstrated how electromagnetic waves can be used for driving current flow, heating, and controlling instabilities in the plasma" [6] . In the quoted article, Barrett et al. describe how a complex version of the HPL benchmark was used to double the performance of ScaLAPACK to solve a system of equations with half a million unknowns using 10,000 cores.
Original Contribution
The value of this work is in combining state-of-the-art solutions in dense linear algebra to overcome the challenges of producing a high speed implementation of the LU factorization for a heterogeneous multicore system with more than one hundred cores. Specifically, this work leverages some very recent developments, such as the parallel-recursive panel factorization [12] , GPU kernel autotuning [27] , and dynamic (superscalar) scheduling [18] , [25] . At the same time, the solution follows the principles of the PLASMA software library, by utilizing the tile data layout and the superscalar scheduling subsystem, which makes the code ready for software integration.
Block LU Factorization
The LAPACK block LU factorization is the main point of reference here, and LAPACK naming convention is followed. The LU factorization of a matrix M has the form M ¼ P LU, where L is a unit lower triangular matrix, U is an upper triangular matrix, and P is a permutation matrix. The LAPACK algorithm proceeds in the following steps: Initially, a set of nb columns (the panel) is factored and a pivoting pattern is produced (DGETF2). Then, the elementary transformations, resulting from the panel factorization, are applied to the remaining part of the matrix (the trailing submatrix). This involves swapping of up to nb rows of the trailing submatrix (DLASWP), according to the pivoting pattern, application of a triangular solve with multiple right-hand sides to the top nb rows of the trailing submatrix (DTRSM), and finally, application of matrix multiplication of the form C ¼ C À A Â B (DGEMM), where A is the panel without the top nb rows, B is the top nb rows of the trailing submatrix, and C is the trailing submatrix without the top nb rows. Then, the procedure is applied repeatedly, descending down the diagonal of the matrix.
RELATED WORK
Seminal work on recursive dense linear algebra algorithms was done by Gustavson [16] , who published recursive formulations of the Cholesky, LDLT, and LU factorizations.
Eventually, recursion was also applied to the QR factorization by Elmroth and Gustavson [14] . Recently, Castaldo and Whaley [7] developed fast implementations of LU and QR panel operations using a technique referred to as parallel cache assignment (PCA). PCA builds on the bulk synchronous parallel model of parallelization [37] , and relies on fork-andjoin execution with barrier synchronizations, but allows for the preservation of data in caches in a sequence of multiple BLAS 2 operations.
Work on GPU accelerated dense linear algebra routines started before general-purpose programming environments, such as CUDA or OpenCL, were available. This time period is often referred to, somewhat ironically, as the general purpose GPU era. The earliest implementation of a matrix factorization was reported by Galoppo [15] , who implemented the nonblocked LU decomposition without pivoting, with partial pivoting, and with full pivoting.
More papers followed when CUDA became available, largely thanks to the CUBLAS library (CUDA BLAS) provided by NVIDIA. Implementations of dense matrix factorizations were reported by Barrachina et al. [5] , Baboulin et al. [3] , and Castillo et al. [8] . Seminal work was done by Volkov and Demmel [38] , where notably a two-GPU implementation of the LU factorization was reported, and one-dimensional block cyclic data distribution was used. It was followed by the work of Tomov et al. [35] , [36] in the context of the matrix algebra for GPUs and multicore architectures (MAGMA) library.
An important part of these developments is the work solely focusing on optimizing matrix multiplication. Early work on tuning GEMMs in CUDA for NVIDIA GPUs targeted the previous generation of GPUs, of the GT200 architecture, such as the popular GTX 280. Pioneering work was done by Volkov and Demmel [38] . Similar efforts followed in the MAGMA project [28] . The introduction of the NVIDIA Fermi architecture triggered the development of MAGMA GEMM kernels for that architecture [30] , [31] , which recently evolved into a systematic autotuning approach named automatic stencil TunerR for accelerators (ASTRA) [27] . Other related efforts include the compiler-based work by Rudy et al. [33] and Cui et al. [10] , and low-level kernel development by Nakasato [29] and Tan et al. [34] .
Dense linear algebra codes, including the Cholesky, LU, and QR factorizations, have also been offloaded to the IBM Cell B. E. accelerator [9] , [22] , [23] , [24] . Two efforts are specifically worth mentioning. Chen et al. [9] developed a single precision implementation of the LINPACK benchmark for the QS20 system, which relied on tile matrix layout and a cache-resident panel factorization. Kistler et al. [20] developed a double precision implementation of the LINPACK benchmark for the QS22 system, which employed a recursive panel factorization.
Bach et al. [4] reported an implementation of the LINPACK benchmark for a system with AMD GPUs and Deisher et al. [11] reported an implementation for the Intel MIC architecture. Both implementations follow a very different approach than the one presented in this paper.
SOLUTION
The solution follows the design principles of the PLASMA numerical library by storing and processing the matrix by tiles and using dynamic, dependency-driven, runtime task scheduling. The same basic idea was previously applied to the tile QR factorization [26] . This paper builds on previous experiences to develop an implementation of a much harder algorithm in a multi-GPU scenario. The sections to follow outline the main hybridization idea, provide the motivation for the use of a tile matrix layout, describe the development of CPU and GPU kernels, explain the scheduling methodology, and discuss the communication requirements.
Hybridization
The main hybridization idea is captured in Fig. 1 and relies on representing the work as a directed acyclic graph (DAG) and dynamic task scheduling, with CPU cores handling the complex fine-grained tasks on the critical path and GPUs handling the coarse-grained data-parallel tasks outside of the critical path.
Some number of columns (lookahead) are assigned to the CPUs and the rest of the matrix is assigned to the GPUs in a one-dimensional block-cyclic fashion (Fig. 2) . In each step of the factorization, the CPUs factor a panel and update their portion of the trailing submatrix, while the GPUs update their portions of the trailing submatrix. After each step, one column of tiles shifts from the GPUs to the CPUs (from device memory to host memory).
The main advantage of this solution is the capability of overlapping the CPU processing and the GPU processing (and also overlapping of communication and computation). The GPUs have to be idle while the first panel is factored. However, the factorization of the second panel can proceed in parallel with the application of the first panel to the trailing submatrix. In practice, the level of overlapping is much bigger, i.e., the panel factorizations are a few steps ahead of updates.
Data Layout
The matrix is laid out in square tiles on the CPU side (host memory), where each tile occupies a continuous region of memory. Tiles are stored in column-major layout and elements within tiles are stored in column-major layout. This layout, referred to as column-column rectangular block (CCRB) [17] , is the native layout of the PLASMA library. Here, only matrices evenly divisible into tiles are considered (Fig. 3) . Inclusion in the PLASMA library would require generalization of the code to matrices that are not evenly divisible into tiles. Tiles are transposed on the GPU side (device memory), i.e., the layout is translated to columnrow rectangular block (CRRB), which is critical to the performance of the row swap (DLASWP) operation (Section 3.5.1). This tilewise transposition is trivial to code and fast to execute (Section 3.5.3).
Parallel Panel on Multicore CPUs
The canonical way of performing panel factorization in the block LU algorithm is to use vector operations and matrixvector operations (Levels 1 and 2 basic linear algebra subroutines (BLAS)). This is what the LAPACK DGETF2 routine does. Very low performance can be expected for any realistic panel sizes, due to the memory-bound nature of Level 1 and 2 BLAS. For instance, for panels of width 192 and height greater than 5,000, the DGETF2 routine barely exceeds 2 Gflop/s of performance on a typical Intel or AMD processor.
The panel factorization is in essence an LU factorization of a narrow submatrix, commonly referred to as a tall and skinny matrix. Therefore, it can be subdivided into a sequence of yet thinner panel factorizations and updates. For instance, the standard panel width in LAPACK is 64, so it makes sense to call the DGETRF function in LAPACK to factorize a panel of width 192. This function will internally perform three panel factorizations of width 64, by calling DGETF2, and two trailing submatrix updated, corresponding to the first two panels. Unfortunately, due to the narrow shape of the submatrices involved, this approach is only slightly more compute intensive, and the performance only goes up to 3 Gflop/s. This is still inadequate, considering that the GPUs can provide in excess of 1 Tflop/s of performance.
The problem is that it takes much longer to factor the panels than it takes to apply the corresponding updates. In such a case, the panel factorizations completely dominate the execution time, effectively nullifying the benefits of the GPUs. Clearly, much faster panel factorization is needed. A similar argument has been made for codes that do not attempt to overlap panel factorizations and updates of trailing submatrices [7] .
The application of recursion allows for a decrease in memory intensity by introducing some degree of level 3 BLAS operations [16] (Fig. 4) . At the same time, tiles of the panel are statically assigned to cores and each core preserves the same set of tiles throughout all the steps of the panel factorization. At some point in the LU factorization, panels become short enough to fit in the aggregate cache of the designated cores, i.e., panel operations become cache resident, which at some level resembles the technique of PCA [7] currently employed by automatically tuned linear algebra software (ATLAS). The cores are forced to work in lock step, but can benefit from a high level of cache reuse. The ultrafine granularity of operations requires very lightweight synchronization. Synchronization is implemented using busy waiting on volatile variables and works at the speed of hardware cache-coherency. Fig. 5 shows the scalability of the panel implementation for panels of width 192, when using 6 cores, 12 cores (one socket), and 24 cores (two sockets). Performance of 6 cores exceeds 12 Gflop/s, performance of 12 cores exceeds 21 Gflop/s, and performance of 24 cores exceeds 28 Gflop/s. This is a tremendous performance improvement over the baseline DGETF2 and DGETRF functions.
CPU Update Kernels
The update is relatively straightforward and requires three operations: row swap (DLASWP), triangular solve (DTRSM), and matrix multiplication (DGEMM). In the case of DLASWP, one core is responsible for swaps in one column of tiles. The LAPACK DLASWP function cannot be used, because of the use of tile layout, so DLASWP with augmented address arithmetic is hand coded. In the case of DTRSM and DGEMM, one core is responsible for one tile. Calls to the Intel Math Kernel Library (MKL) are used with layout set to column major.
GPU Kernels
The set of required GPU kernels includes the kernels to apply the update to the trailing submatrix (DLASWP, DTRSM, and DGEMM), and the kernel to translate the panel between the CCRB layout, used on the CPU side, and the CRRB layout, used on the GPU side. The DLASWP kernel, the DTRSM kernel, and the transposition kernel are simple to write and do not have much impact on the runtime. These kernels are described first. They are followed by a longer description of the DGEMM kernel, which dominates the execution time and is the most complex.
DLASWP
The DLASWP routine swaps rows of the trailing submatrix according to the pivoting pattern established in the panel factorization. This operation only performs data motion, and the GPUs are very sensitive to the matrix layout in memory. In raw-major layout, threads in a warp can simultaneously access consecutive memory locations. This is not the case in column-major layout, where threads access memory with a stride. In this case, each thread generates a separate memory request, which is devastating to performance. As a result, performance is two orders of magnitude lower, and the swap operation dominates the update. This forces the use of the CRRB format, i.e., row-major storage of elements within tiles. As soon as the CRRB format is used, a straightforward implementation of the DLASWP operation completely suffices. Each thread block is tasked with swaps in one column of tiles and creates NB threads to perform them, one thread per one column of elements. This may not yet be the fastest possible way of implementing the swap. However, at this point, the impact of the swap operation on the overall performance becomes negligible.
DTRSM
The DTRSM routine uses the lower triangle of the NB Â NB diagonal block to apply triangular solve to the block of right-hand sides formed by the top NB rows of the trailing submatrix. An efficient implementation of this routine on a GPU is difficult due to the data-parallel nature of GPUs and small size of the solve (32 NB 288).
In such a case, the standard procedure for GPUs is to replace the in-place triangular solve operation with an outof-place multiplication of the block of right-hand sides by the inverse of the triangle. After the panel factorization, one CPU core applies the triangular solve to an NB Â NB identity matrix. In the update phase, the GPUs call the DGEMM routine to apply the inverted matrix to the block of right-hand sides in an out-of-place fashion, followed by a copy of the result to the location of the original block of right-hand sides.
This operation executes at the speed of the DGEMM operation, with twice as many FLOPs as the standard DTRSM function. This is the fastest way of implementing it, known to the authors. Because it only affects a small portion of the trailing submatrix, its execution time is negligible, compared to DGEMM.
CCRB-CRRB Conversion
As already mentioned in Section 3.2, tile layout has numerous advantages and is the layout of choice for the PLASMA library. However, PLASMA lays out data in tiles by columns, and the GPUs require data to be laid out by rows. Otherwise, the DLASWP operation cannot perform adequately. Therefore, an operation is needed which internally transposes each tile, i.e., makes a conversion between the CCRB and the CRRB formats.
A very simple implementation is used here. Each thread block launches 1,024 threads arranged in a 32 Â 32 grid, and each thread swaps two elements of the matrix to their transposed locations. The submatrix (column) being transposed is overlaid with a rectangular grid of blocks. Threads with the first element below the tile's diagonal perform the swap. Threads with the first element above the diagonal quit. At this point, the impact of the swap operation on the overall performance is negligible.
DGEMM
The DGEMM kernels are produced using the ASTRA system [27] , which follows the principles of automated empirical optimization of software, popularized by the ATLAS [39] . The same process is currently used to produce DGEMM kernels for the MAGMA project.
The kernel is expressed through a parametrized stencil, creating a large search space of possible implementations. The search space is aggressively pruned, using mostly constraints related to the usage of hardware resources. On NVIDIA GPUs, one of the main selection criteria is occupancy, i.e., the capability of the kernel to launch a big number of single instruction multiple threads threads. The pruning process identifies a few tens of kernels for each tile size. The final step of autotuning is benchmarking these kernels to find the best performing ones.
There are two differences between the kernels used here and the MAGMA kernels. MAGMA kernels operate on matrices in canonical FORTRAN 77 column-major layout, compliant with the BLAS standard. The kernels used here operate on matrices in CRRB tile layout [17] . Also, MAGMA kernels are tuned for the case where all three input matrices are square, while the kernels used here are tuned for the block outer product operation in the LU factorization, i.e., C ¼ C À A Â B, where the width of A and the height of B are equal to the matrix tile size nb. Fig. 6 shows the performance of the fastest kernels. 
Scheduling
Manually, multithreading the hybrid LU factorization would be nontrivial. It would be a challenge to track dependencies without automation, given the three different levels of granularity involved: single tile, one column, a large block (submatrix). Here, the QUARK superscalar scheduler [40] is used for automatic dependency tracking and work scheduling. The LU factorization code is expressed with the canonical serial loop nest (Fig. 7) , where calls to CPU and GPU kernels are augmented with information about sizes of affected memory regions and directionality of arguments (IN, OUT, INOUT) . QUARK schedules the work by resolving data hazards (RaW, WaR, WaW) at runtime. Three important extensions are critical to the implementation of the hybrid LU factorization: task prioritization, variable-length list of dependencies, and support for nested parallelism.
The first feature is task prioritization. It is essential that CPUs aggressively execute the critical path, i.e., traverse the DAG in a depth-first fashion. This guarantees that the panels are executed quickly and sent to the GPUs. The DAG, however, is never built in its entirety and the scheduler has no way of knowing the critical path. Instead, the critical path is indicated by the programmer, by using a priority flag when queuing the tasks in the critical path: panel factorizations and updates of the columns immediately to the right of each panel. Prioritized tasks are placed in the front of the execution queue.
The second feature is variable-length lists of parameters. CPU tasks, such as panel factorizations and row swaps, affect columns of the matrix of variable height. For such tasks, the list of dependencies is created incrementally, by looping over the tiles involved in the operation. It is a similar situation for the GPU tasks, which involve large blocks of the matrix (large arrays of tiles). The only difference is that, here, transitive (redundant) dependencies are manually removed to decrease scheduling overheads, while preserving correctness.
The third crucial extension of QUARK is support for nested parallelism, i.e., superscalar scheduling of tasks, which are internally multithreaded. The hybrid LU factorization requires parallel panel factorization for the CPUs to be able to keep pace with the GPUs. At the same time, the ultrafine granularity of the panel operations prevents the use of QUARK inside the panel. Instead, the panel is manually multithreaded using cache coherency for synchronization and scheduled by QUARK as a single task, entered at the same time by multiple threads.
Communication
Communication is shown on Fig. 8 . Each panel factorization is followed by a broadcast of the panel to all the GPUs. After each update, the GPU in possession of the leading leftmost column sends that column back to the CPUs (host memory). These communications are expressed as QUARK tasks with proper dependencies linking them to the computational tasks. Because of the use of lookahead, the panel factorizations can proceed ahead of the trailing submatrix updates and so can transfers, which allows for perfect overlapping of communication and computation, as further discussed in the following section.
RESULTS
This section includes a precise description of the hardwaresoftware environment, followed by the performance results and a detailed discussion.
Hardware and Software
The system used for this work couples one CPU board with four sockets and one GPU board with four sockets. The CPU board is an NVIDIA Tesla S2050 system with four Fermi chips, 14 multiprocessors each, clocked at 1.147 GHz. The CPU board is a H8QG6 Supermicro system with 4 AMD Magny Cours chips, 12 cores each, clocked at 2.1 GHz.
The theoretical peak of a single CPU socket amounts to 2:1 GHz Â 12 cores Â 4 ops per cycle ' 101 Gflop=s, making it $ 403 Gflop/s for all four CPU sockets. The theoretical peak of a single GPU amounts to 1:147 GHz Â 14 cores Â 32 ops per cycle ' 514 Gflop=s, making it $ 2; 055 Gflop/s for all four GPUs. The combined CPU-GPU peak is $ 2459 Gflop=s.
The system runs Linux kernel version 2.6.35.7 (Red Hat distribution 4.1.2-48). The CPU part of the code is built using GCC 4.4.4. Intel MKL version 2011.2.137 is used for BLAS calls on the CPUs. The GPU part of the code is built using CUDA 4.0. Fig. 9 shows the overall performance of the hybrid LU factorization, and point, the size of memory on all GPUs is exceeded. Each GPU can provide 2.6 GB of error correcting code protected memory. For comparison, the light gray line shows the performance of a CPU-only run using all 48 CPU cores, which is equivalent in behavior and performance to a call to the DGETRF routine in PLASMA. Fig. 10 shows a small fragment in the middle of a 23,040 run (the smallest size exceeding 1 Tflop/s performance). In the CPU part, only the panel factorizations are shown. The steps shown on the figure correspond to factoring submatrices of size $ 12;000. Due to the deep lookahead, panel factorizations on the CPUs run a few steps ahead of trailing submatrix updates on the GPUs. This allows for perfect overlapping of CPU work and GPU work. It also allows for perfect overlapping of communication between the CPUs and the GPUs, i.e., between the host memory and the device memories. Each panel factorization is followed by a broadcast of the panel to the GPUs (light gray DMA). Each trailing submatrix update is followed by returning one column to the CPUs (dark gray DMA). Fig. 11 shows the performance of the panel factorization throughout the largest run (34,560), using different numbers of cores, for panels of width 192. The jagged shape of the lines reflects the fact that the panel cores have to compete for main memory with the other cores, applying updates at the same time. Generally, more cores provide higher performance, due to more computing power and larger capacity of their combined caches. However, 24 cores (two sockets) provide only a small performance improvement over 12 cores (single socket) due to the higher cost of intersocket communication over communication within the same socket. In actual LU runs, the use of 12 cores turns out to always be optimal, even for large matrices. While 12-core panel factorizations are capable of keeping up with GPU updates, the remaining cores can be committed to CPU updates. Fig. 12 shows the performance of the GPU DGEMM kernel throughout the entire factorization. The gray line shows the DGEMM kernel performance on a single GPU. The black line shows the performance of the 4-GPU DGEMM task. The jagged shape of the line is due to the load imbalance among the GPUs. The high peaks correspond to the calls where the load is perfectly balanced, i.e., the number of columns updated by the GPUs is divisible by 4. When this is not the case, the number of columns assigned to different GPUs can differ by one. The load imbalance can be completely eliminated by scheduling the GPUs independently. Although, potential performance benefits are on the order of a few percent.
Performance

CONCLUSIONS
The results reveal the challenges of programming a hybrid multicore system with accelerators. There is a disparity in the performance of the CPUs and the GPUs to start with. It turns into a massive disproportion when the CPUs are given the difficult (synchronization-rich and memorybound) task of panel factorization, and the GPUs are given the easy (data-parallel and compute-bound) task of matrix multiplication. While the performance of panel factorization on the CPUs is roughly at the level of 20 Gflop/s, the performance of matrix multiplication on the GPUs is almost at the level of 1,200 Gflop/s (two orders of magnitude). The same disproportion applies to the computational power of the GPUs versus the communication bandwidth between the CPU memory and the GPU memory (host to device). The key to achieving good performance under such adverse conditions is overlapping of CPU processing and GPU processing, and overlapping of communication.
SOFTWARE
The code is available from the authors upon request. If released, the code will be available under the modified BSD license. Piotr Luszczek is a research director at the University of Tennessee Knoxville's Innovative Computing Laboratory. His core research activity is centered around performance modeling and evaluation. He has extensive experience with high performance numerical linear algebra and signal processing codes that achieve high efficiency on a varied array of hardware architectures, including massively parallel high end distributed memory machines, shared memory servers, and mobile platforms that all feature specialized and general purpose accelerators running on the major operating systems. His research also revolves around long-term energy consumption and performance trends in high performance and cloud computing. His contributions to the scientific community include conference proceedings, journals, book chapters, and patent applications that showcase his main research agenda and expertize, as well as programming paradigms, parallel language design and productivity aspects of high performance scientific computing. He is a member of the IEEE.
Mathieu Faverge received the PhD degree in computer science from the University of Bordeaux 1, France. He is a postdoctoral research associate at the University of Tennessee Knoxville's Innovative Computing Laboratory. His main research interests are numerical linear algebra algorithms for sparse and dense problems on massively parallel architectures, and especially DAG algorithms relying on dynamic schedulers. He has experience with hierarchical shared memory, heterogeneous and distributed systems, and his contributions to the scientific community include efficient linear algebra algorithms for those systems. He is a member of the IEEE. Jack Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advancedcomputer architectures, programming methodology, and tools for parallel computers. He received the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008, he received the first IEEE Medal of Excellence in Scalable Computing; in 2010, he was the first recipient of the SIAM Special Interest Group on Supercomputing's award for Career Achievement; and in 2011, he received the IEEE IPDPS 2011 Charles Babbage Award. He is a fellow of the AAAS, ACM, and SIAM, and a member of the National Academy of Engineering. He is a life fellow of the IEEE.
