Basic Linear Algebra Subprograms (BLAS) are a set of low level linear algebra kernels widely adopted by applications involved with the deep learning and scientific computing. The massive and economic computing power brought forth by the emerging GPU architectures drives interest in implementation of compute-intensive level 3 BLAS on multi-GPU systems. In this paper, we investigate existing multi-GPU level 3 BLAS and present that 1) issues, such as the improper load balancing, ine cient communication, insucient GPU stream level concurrency and data caching, impede current implementations from fully harnessing heterogeneous computing resources; 2) and the inter-GPU Peerto-Peer(P2P) communication remains unexplored. We then present BLASX: a highly optimized multi-GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles treating a matrix tile as the basic data unit and operations on tiles as the basic task. Tasks are guided with a dynamic asynchronous runtime, which is cache and locality aware. The communication cost under BLASX becomes trivial as it perfectly overlaps communication and computation across multiple streams during asynchronous task progression. It also takes the current tile cache scheme one step further by proposing an innovative 2-level hierarchical tile cache, taking advantage of inter-GPU P2P communication. As a result, linear speedup is observable with BLASX under multi-GPU configurations; and the extensive benchmarks demonstrate that BLASX consistently outperforms the related leading industrial and academic implementations such as cuBLAS-XT, SuperMatrix, MAGMA.
INTRODUCTION
Matrix scaling, additions and multiplications are basic operations in linear algebra libraries, scientific simulations and deep learning. O✏oading these operations to the BLAS library, which is the case in general practices, can substantially improve the application performance due to the architecture specific optimizations inside BLAS kernels. This leads BLAS to become the standard building blocks for performing low level matrix operations in applications. Hence, the BLAS library directly a↵ects the application performance. In the last few years, the evolving GPU architectures, NVIDIA's Kepler and Maxwell in particular, feature thousands of stream processors, proven to be extremely efficient in computing level 3 BLAS.
While a multi-GPU system o↵ers appealing high performance, using it entails nontrivial e↵ort. A multi-GPU system typically consists of at least one CPU connected with peripheral GPUs on PCI-E. GPU operates on its private onboard RAM while CPU operates on the host RAM. For GPUs sharing the same I/O hub, they can directly communicate via PCI-E switch referred to as GPU P2P access. The following factors need to be considered to fully utilize such architecture: (1) nowadays GPUs of di↵erent architectures represent divergent computing capabilities; and even the realtime performance of a GPU varies with respect to factors such as kernel saturation and GPU occupancy, all of which pose a great challenges to load balancing. (2) Minimizing and overlapping communication is key to achieve high performance. (3) Reducing the CPU-GPU communication to the GPU-GPU communication further improves the communication and energy e ciency. Unfortunately, our study indicates the existing multi-GPU level-3 BLAS implementations fail to optimize toward these factors, thereby delivering sub-optimal performance.
In this paper, we present the BLASX 1 : a high-performance level-3 BLAS library for heterogeneous multi-GPU systems. We address the load balancing with a dynamic scheduling runtime, which handles task level workload variations, single GPU realtime performance variations and speed discrepancies among heterogeneous multi-GPUs. BLASX adopts a novel two level hierarchical tile caches to explore the tile temporal locality, in which we consider the GPU onboard RAM as the L1 tile cache and the combined multi-GPU RAMs as the L2 tile cache. The L1 tile cache minimizes global communication; and the L2 tile cache successfully reduces the CPU-GPU communication to the GPU-GPU communication. In implementing this hierarchical tile caches, we propose a new LRU algorithm to accommodate the asynchronous task progression and a new cache coherence protocol to ensure the data consistency on multi-GPUs. BLASX also optimizes the communication/computation overlapping on GPU streams so that the communication cost is negligible. Finally, BLASX o↵ers backward compatibility to the vast existing CPU BLAS based applications; thereby all the details, such as workload balancing, data caching, communication overlapping and memory management, can be ignored by library users.
We evaluate BLASX on two multi-GPU systems against the related leading academic and industrial projects including cuBLAS-XT, SuperMatrix and MAGMA. BLASX consistently outperforms the academic implementations on Everest with 3 NVIDIA Kepler K40c. In comparison with the highly optimized NVIDIA cuBLAS-XT, BLASX demonstrates on average 25% performance gain and 200% less communication volume. Makalu features 2 Kelper K40 and 2 Maxwell TITAN X. BLASX successfully tackles the heterogeneity and demonstrates linear speedup; whereas other libraries such as cuBLAS-XT, MAGMA and SuperMatrix su↵ers from poor scalability.
We organize the remaining paper as follows. Section II analyzes the background and related works; and section III briefly reviews the L3 BLAS tile algorithms. In Section IV, we elaborate the detailed design and implementations of BLASX including two level hierarchical tile caches and the scheduling runtime. We also present solutions to specific questions such as amortizing high frequency memory allocation and deallocation, communication and computation overlapping. The comprehensive evaluations of BLASX against existing state-of-art implementations are presented in the Section V. Finally, we conclude at Section VI.
BACKGROUND AND RELATED WORK
There are three levels of BLAS, divided with respect to the complexity of operations. Level-1 (L1) BLAS targets vector operations in O(n) such as vector dot products and vector norms. Level-2 (L2) BLAS targets matrix-vector operations in O(n 2 ) such as matrix-vector multiplication. Level-3 (L3) BLAS [1] targets matrix operations in O(n 3 ) time such as matrix-matrix multiplications. The focus of this research is on L3 BLAS, which uses General Matrix Multiplication (GEMM) as the primary building block for the routines within the category. Therefore, the task of improving the performance of L3 BLAS can be reduced to the GEMM speed.
The massive but economic TFLOPS brought forth by the evolving GPU architectures drives interests in the various implementations of multi-GPU L3 BLAS. SuperMatrix [2] is one of the pioneers of matrix operation parallelization on SMP multicores, however it provides limited support on GPUs. The key insight of SuperMatrix is that a matrix can be factorized into a set of tiles. The Tomasolu algorithm [3] subsequently schedules these tiles in the out-oforder fashion. Fig.1a demonstrates that SuperMatrix suffers from costly nonoverlapped CPU-GPU data transfers. MAGMA [6] is another multi-GPU linear algebra library with incomplete LAPACK and BLAS support. It is a heavily hand tuned library relying on a static load balancer, which degrades its portability and performance on heterogeneous multi-GPU systems. Direct Acyclic Graph (DAG) scheduling has seen a revival in the recent years as it can naturally integrate with tile algorithms. PaRSEC [7] is a leading DAG scheduling runtime for dense linear algebraic operations. Building DAGs at runtime and scheduling tasks within DAGs, however, can be a huge cost for the small scale L3 BLAS operations. PaRSEC also assumes constant workload on each fine-grained task and constant speed on each GPU. Whereas it is possible to have workload variations; and the GPU kernel saturation also a↵ects the actual execution speed. In addition, PaRSEC only exploits tile reusing within a GPU; Caching on multiGPU memory spaces by taking advantages of GPU-GPU P2P communication still remains unexplored. None of the aforementioned libraries are backward compatible to legacy CPU BLAS. As a reaction to the market, NVIDIA released a commercial multi-GPU L3 BLAS, cuBLAS-XT [8], declaring it to be backward compatible when using the NVBLAS wrapper. cuBLAS-XT consistently moves tiles on demand into GPU RAM so that it can compute a large scale problem with a few MB of GPU RAM. Although major communication is overlapped, it does not address tile caching; and this aggressive on demand communication pattern extremely overloads the PCI-E as shown by the contiguous yellow blocks in Fig.1b .
In summary, these libraries cannot deliver the optimal performance due to following issues: 1) insu cient communication/computation overlapping subjects SuperMatrix and cuBLAS-XT to the suboptimal performance. 2) excessive communication in cuBLAS-XT overloads the PCI-E dragging down the overall performance. 3) low GPU occupancies in SuperMatrix and cuBLAS-XT indicate partial GPU utilization. 4) e cient GPU-GPU P2P communication remains unexplored in PaRSEC. We observe that the average throughput of CPU-GPU communication is 6.54 GB/S while the GPU-GPU is 7.80 GB/S. Fig.1c demonstrates BLASX successfully resolves issues 1!3; For other issues such as P2P communication, portability, handling hardware heterogeneity and load balancing, please refer to Performance Evaluation for detailed discussion.
A REVIEW OF L3 BLAS TILE ALGO-RITHMS
In this section, we give an overview of L3 BLAS tile algorithms. L3 BLAS is intended for O(n 3 ) matrix operations including General Matrix Multiplication (GEMM), symmetric rank-k update (SYRK), symmetric rank-2k update (SYR2K), triangular matrix multiplication (TRMM), symmetric matrix multiply (SYMM), and triangular solve with multiple right hand side (TRSM). Since the Hermitian matrix multiplication (HEMM), Hermitian rank-k update (HERK), and Hermitian rank-2k update (HER2K) are the complex counterparts of GEMM, SYRK and SYR2K respectively, we omit their discussion in this paper. 
Representing Matrix as Tiles

Traditional GEMM Based L3 BLAS Tile Algorithms
The traditional tile implementation of L3 BLAS on the CPU relies on a highly optimized GEMM and a small amount of L1 and L2 BLAS [9] or other L3 BLAS routines. The following equations illustrate the none-transpose, upper triangular cases of L3 BLAS tile algorithms: GEMM, SYRK, TRSM, TRMM, SYR2K and SYMM, respectively:
A Simple Trick to Matrix Transpose
The transpose of a tiled matrix requires a transpose on each tiles in addition to swapping the tiles A ij and A ji to be consistent with the definition of matrix transpose. The concept is as follows: 
This simple trick significantly facilitates the matrix transpose. Rather than physically transposing the entire matrix, we can retrieve the tile A ji and transpose the tile inside the BLAS kernel to implicitly transpose the matrix. 1 presents the GEMM percentages in L3 BLAS with respect to 3 square matrix sizes. The percentages of GEMM, as indicated in 1a to 1f, increase with the size of matrices to a point that the entire computation eventually be dominated by the GEMM kernel, achieving GEMM-like performance.
The GEMM Dominant L3 BLAS
BLASX: A MULTI-GPU L3 BLAS WITH A LOCALITY AWARE DYNAMIC RUN-TIME
We organize this section as follows. We begin by introducing a set of new L3 BLAS tile algorithms for heterogeneous multi-GPUs. Subsection 4.1 elaborates the L3 BLAS taskization; Subsection 4.2 elaborates a novel two-level hierarchical tile cache on multi-GPU RAMs; Subsection 4.3 elaborates our locality aware scheduling runtime covering load balancing, scheduling infrastructure, and scheduling strategies; Subsection 4.4 elaborates communication/computation overlapping and GPU out-of-core operations. In the end, we present a novel heap design to amortize the overhead introduced by high-frequency GPU memory allocation/deallocation in Subsection 4.5.
Alg.1 describes the skeleton of the proposed L3 BLAS algorithms for heterogeneous multi-GPUs. Lines 1 to 6 indicate the general runtime procedures; and lines 8 to 25 indicate the GPU specifics during the computation. In lines 1 to 6, the runtime initializes a hierarchical tile cache and a global non-blocking task queue. Then a CPU thread is spawned for each GPU to submit instructions. The threads join together after the global task queue depletes. In lines 8 to 25, GPUs concurrently retrieve tasks and interleave them via multi-streams to overlap the communication during the asynchronous progression. The line 13 indicates two ways of task retrieval, either from global task queue or working stealing. A GPU steals tasks from other Reservation Stations (RS), which only happens when the GPU exhausts tasks on RS while the global queue is also empty. Lines 22 and 23 reflect tile reuses with the proposed tile caches. Line 25 suggests an asynchronous L3 BLAS kernel invocation, the type of which is dictated by Eq.1 given the tile indices i, j and k. Lines 18, 19 and 24 indicate the communication/computation overlapping. Line 17 reflects the ALRU reader updates. Specifically the reader of a tile is atomically incremented if a task need it. On the other hand, the StreamsSynch() 17 ReaderU pdate() 18 for k do 19 for task 2 Top 4 Prioritized Tasks in RS do 20 i task, j task
reader is atomically decremented if a task releases the tile. We only update readers promptly after the synchronization point because that's the only place to inform the tile status.
Taskizing L3 BLAS
We define the task as solving a C ij in Eq.1. Tile algorithms yield the insight to systematically break down the output matrix C into a set of tiles C ij , the computation of which, involves reading A ik , B kj , C ij and an independent write of C ij . Hence concurrently solving the C ij is data hazards free. Eq.1 also indicates that the workload of C ij in non-GEMM routines varies according to the upper bound of k. In summary, a task has 3 notable properties by following definition:
• Reading the inputs for a task is data dependency free.
• Concurrent writing a task's output is data race free.
• The workload of each task varies. Given a matrix C of size M ⇥ N and tile size T, the degree of parallelism is as follows:
In our implementation, a task holds the necessary metadata to solve a C ij such as tile indices i, j and k, the dimensions of C ij , and its host address. The runtime virtually slices a matrix and stores the tile metadata in tasks. Consequently, 
Two Level Hierarchical Tile Caches
Fig .2 demonstrates the structure of separate memory spaces on the multiGPU system. Each GPU equips with a private RAM; and CPUs share the host RAM. Nowadays, GPU RAM can be up to 12 GB while a single double precision matrix of size 32768 ⇤ 32768 is 8.6 GB. The relatively small GPU RAM capacity limits the GPU in-core computing from handling the large scale L3 BLAS. One solution using tile algorithms is to dissect a large L3 BLAS into smaller ones, and solve them in succession. At each time, we move in tiles from the host on demand. Frequent GPU o↵-chip memory access, however, overloads the PCI-E and degrades the performance to be suboptimal. Meanwhile it is possible to reuse certain tiles in separate tasks as indicated by Eq.1. Therefore, exploiting the tile temporal locality by caching the most frequently used ones on the GPU RAM is necessary.
We present a novel two level fully associative tile caches in Fig.2 . The L1 tile cache is implemented using GPU onboard RAM; the L2 tile cache is implemented using the combined memory spaces of GPUs, which share the same I/O hub. The L1 tile cache hit enables direct tile reuse; the L2 tile cache hit reduces the CPU-GPU communication to GPU-GPU communication by retrieving the tile from the hardware adjacent GPU. The rational of implementing the L2 cache are twofold: (1) GPU P2P communication better saturates PCI-E delivering at least 6 GB/s performance. [10] (2) The comparably faster GPU RAM reduces the latency of data fetching. Therefore it is more cost e↵ective to retrieve a tile from a GPU than from the host RAM.
To implement the L1 tile cache, we need a LRU to discard the least frequently used tiles. Unfortunately the vanilla LRU algorithm [11] can not accommodate the asynchronous kernel launches in BLASX. Consequently, we propose the Approximate Least Recent Used (ALRU) to handle asynchronicities. Alg.2 presents the three basic operations (Translate, Dequeue, and Enqueue) in the proposed ALRU. Translate simulates both the cache hit and miss by providing the GPU address given the CPU address; Dequeue aims at freeing the least recent used tile in the cache; and Enqueue aims at moving a new tile in the cache. We use a list to implement the ALRU with the tail representing the least recently To implement the L2 tile cache, we need a cache coherence protocol to ensure the consistency of shared tiles in multiple places. We adopt the MESI-X cache coherence protocol, a variant of MESI protocol [12] , to work for the BLASX's Fig.3 demonstrates the state transition diagram of the proposed MESI-X.
Scheduling Infrastructure and Strategies
Our scheduling runtime achieves three specific goals: the proper load balancing on heterogeneous multi-GPUs and multi-CPUs, the e cient communication with locality aware scheduling and the su cient overlapping of computation and communication. Fig.4 presents the scheduling infrastructure in our locality aware scheduling runtime, which consists of 4 major components:
1) GPU Computation Thread : a CPU thread to submit tasks for a specific GPU. To avoid the OS scheduling preemption, we bind the thread to a dedicated CPU core. The communication/computation overlapping requires at least 2 tasks concurrently running on streams; while Wei et al. [7] demonstrate no performance gain when using more than 4 streams. This leads us to adopt 4 concurrent tasks to overlap the computation/communication, which also explains the 4 streams used in Fig.4 .
2) CPU Computation Thread : a CPU thread to submit tasks for the rest of CPU cores. Peng et al. proposes the hybrid tile layout to CPU Cores and GPUs due to the inherent devices' performance di↵erences [13] . We adopt the same concept but di↵erent approach. The CPU cores dequeue one task at each time and solve the task with a multithreaded BLAS kernel, where the tile is further subdivided. the upcoming tasks for a GPU. The runtime conducts work stealing and priority scheduling on it. Each slot of a RS conveys a task priority, a task metadata, and a stream index. 4) Non-blocking Task Queue: It is a non-blocking queue allowing e cient concurrent dequeue and enqueue operations based on the algorithm proposed by Maged and Michael [14] .
On heterogeneous systems, tasks can be executed on any computing device with load balancing being key to achieve optimal performance. In reality, it is possible to have task workload variation computed on the processors of various speeds. The two e↵ects accentuate the uncertainty of processors on tasks consumption speed. One simple load balancing solution is to distribute tasks based on the inherent processors' speeds, which is the case in PaRSEC; the actual execution time, however, changes with the GPU kernel saturation and the actual tasks' workload. Hence this solution may lead heavy tasks to clog on the slower processor(s). Our load balancing scheme leverages the task-level workload and the processors' real time speed to achieve the optimal performance. We treat GPUs about to entering idle states as a sign of demand, causing the thread to dequeue tasks. The key of our dynamic task-scheduling runtime is that processors simultaneously pull out tasks from the global non-blocking task queue by their demands. Specifically, faster processors consume more tasks and initiate more demands while the slower processors consume fewer and demand less. The load is thereby adjusted according to the real time demand of individual processors. The perfect scenario in this case is such that processors, regardless of speed di↵erence, spend identical time on task execution without idling.
We adopt the work sharing and the work stealing to achieve this demand driven load balancing on the proposed runtime infrastructure. Lines 10 to 13 in Alg.1 indicate each GPU populates the a liated RS either through global task queue or work stealing. The global non-blocking task queue simulates the work sharing by enabling concurrent task retrieval from multi-GPUs. Since the line 16 in Alg.1 is a synchronization point, a GPU will not demand tasks from RS unless it finishes current ones. Hence GPUs that demand more attempt to pull out more tasks from the queue. The work stealing intends to further improve the load balancing under the situation of empty global task queue but full RS on GPUs. In this case, a underutilized GPU or CPU takes the initiative to steal a task from a overloaded RS, further balancing tasks at a finer level.
Prioritizing tasks with the better temporal locality lessens unnecessary communication. Lines Figure 6 : A fast heap design to amortize the GPU memory allocation/deallocation overhead.
coming in. The runtime populates tasks' priorities based on the extent of potential cache hits by checking ALRU to figure out existing tiles in the L1 and L2 tile cache. As new tasks coming, the runtime resolves dependences and assign higher priorities to tasks that will have better cache hits. The tile locality has 3 scenarios: it hits the L1 tile cache; it hits the L2 tile cache and it is located at the host RAM. In terms of communication cost, there is no cost for L1 cache hit, and the L2 cache hit incurs less cost than retrieving the tile from the host. Since L1 tile cache incurs less cost than L2 tile cache, we credit a L1 tile cache hit with 2 point while a L2 tile cache hit with 1 point. The tasks with higher credits are thereby assigned higher priorities to maximize cache hits so that communication volume is minimized globally.
Since the focus of this research is to implement an e cient yet highly portable heterogeneous multi-GPU L3 BLAS, this priority scheme works well in experiments. Future study of other priority schemes is necessary; and they can be easily integrated in BLASX.
Overlapping Computation with Communication
The CUDA stream is sequential operations executed in the issued order on the GPU with two notable properties. First, the operations on di↵erent streams can be simultaneously executed within the same physical device. This property enables the communication/computation overlapping via moving the data on one stream while executing kernels on another one. Secondly, streams can divide the GPU processing power between a few tasks by allocating segments to each task in turn. With these two properties, a tight interleaving of tasks on multiple streams can render the actual communication cost trivial.
L3 BLAS tile algorithms indicate that a task essentially involves k 2 [1, z] steps of kernel execution. The runtime overlaps the communication/computation by interleaving tasks as follows: First, the RS directly maps top 4 prioritized tasks onto 4 CUDA streams. Second, the runtime executes each step, k, on all streams such that tasks in the step are computed before advancing to the next step (line 19-25 Alg.1), which is guaranteed by the sequential execution feature of CUDA stream. Solving a step of tasks involves data transfer of the required tiles (if cache miss) followed by kernel executions. As tasks progress on multiple streams in this way, the data transfer on a stream eventually overlaps with the kernel execution on another one. In sum, we can consider the overall execution time to be the sum of kernel execution time, plus initial tile move in, and plus final tile move out. This negligible communication cost enables the input and output matrices to reside on the host memory. Consequently we claim our GPU operations are out-of-core. 
Amortize GPU RAM Allocation and Deallocation Overhead
GPUs require memory allocation for tiles move-in and deallocation for tiles move-out. Increasing the computation scale leads to the performance deterioration due to the significant overhead of memory allocation/deallocation. Fig. 5 presents the performance degeneration with CUDA's native memory management utilities such as cudaMalloc and cud-aFree. As a consequence, we implement a fast heap based GPU memory management utilities, BLASX Malloc, to alleviate this issue. The core concept of it is to amortize the allocation/deallocation overhead by adopting a big chunk of GPU memory as the preallocated heap. Fig. 6 presents the basic scheme of the proposed heap design, which consists of a meta-data list, a occupied list and an empty list. A node in the meta-data list traces the length of memory segment and its occupation status. Each block of the occupied and empty list tracks the allocated segment and the free segment respectively. The dynamics of this heap is as follows: During the allocation, the heap searches for the first node with enough memory in the empty list, which is subsequently split into two nodes. One for the occupied list recording the allocated memory; The other for the empty list recording the residual memory. During the deallocation, the runtime locates the segment in the occupied list with a hashtable. If either the node's left or right neighbors are contiguous with the node in terms of memory, they merge together. Then the segment is marked as free and placed back to empty list afterwards. Fig. 5 demonstrates our heap based method e↵ectively amortizes the memory allocation and deallocation overhead.
PERFORMANCE EVALUATION
In this section, we present comprehensive evaluations of our L3 BLAS routines. We conducted the experiments on two shared memory machines Everest and Makalu, the specifications of which are included TABLE 2.
The Comprehensive L3 BLAS Benchmark
We evaluate performance against a commercial product cuBLAS-XT and seminal academic libraries such as Super-Matrix, PaRSEC and MAGMA (Please note PaRSEC only has multiGPU DGEMM; and MAGMA only has multiGPU DTRSM and DSYR2K). We setup benchmarks on the machine Everest (TABLE 2) using double precision L3 BLAS routines. The bechmarks of MAGMA, SuperMatrix and PaRSEC used exemplary benchmark utilities provided in these libraries. The memory range of input or output matrices is page locked to expedite PCI-E transfer. In practice, user only needs to pagelock a matrix once to take advantage of fast PCI-E. If the matrix is repetitively used, the time spent on page-locking is amortized. Therefore we ex- Bidirectional Host and GPU GPU to GPU Throughput 6.54 GB/s 7.8 GB/s clude the page-locking time from performance metric, which is also the case for exemplary test routines in PaRSEC, MAGMA and cuBLAS-XT. The alpha and beta are two random float constants; other parameters such as UP LO, SIDE, T RANS and DIAG are ensured to be same in each comparison. The input matrix size N starts from 1024 to 39936 at increments of 1024. The performance data are from the average of 3 runs; execution order of BLASX, cuBLAS-XT, MAGMA, SuperMatrix and PaRSEC is randomized to eliminate the potential ordering influence. Fig.7 demonstrates the comprehensive benchmarks on Everest. In single GPU benchmarks, the mean performance of BLASX converges to 92.68% of the in-core cuBLAS DGEMM peak; whereas the average performance of PaRSEC, MAGMA, cuBLAS-XT and SuperMatrix attains 91.10%, 81.28% , 79% and 63.99% of in-core cuBLAS DGEMM peak respectively. Even though PaRSEC achieves comparable performance, its GPU in-core operation limits PaRSEC to handle matrix sizes N > 22528 as the required memory, 22528 ⇤ 22528 ⇤ 8 ⇤ 3 = 12.18 GB, is beginning to exceed the GPU RAM 12 GB capacity. This also explains partial benchmarks on the MAGMA DSYR2K and DTRSM. The su cient GPU communication/computation overlapping is one of predominant factors to the high performance of BLASX. Whereas Supermatrix follows a simple fork and join model blocking kernel launches until the on demand data is transferred. The nonoverlapped communication in SuperMatrix incurs too much latency to delivery comparable performance. Hence we omit its discussion in the rest of paper.
BLASX demonstrates linear speedups and the highest scalability under multiGPU configurations. Fig.7 indicates performances increase with the matrix size and reaches a plateau after N > 15000. At the matrix size 16384, the DSYR2K speedup of BLASX, cuBLAS-XT, MAGMA on two GPUs are 1.99x, 1.83x, 1.91x; and the triple GPU speedup are 2.91, 2.16, 2.88. However, real world applications often entail small scale matrix N < 15000. Measuring parallelizations at various matrix sizes is more convincing than concluding the speedup at a particular matrix size. Since the parallel e ciency is a performance metric to describe the scalability at a specific problem size N , we calculate the average parallel e ciency based on 39 matrix sizes N 2 [1024, 39936] to yield a global insight about scalabilities at various matrix sizes; and we adopt forward padding to partial benchmarks in MAGMA and PaRSEC. In larly BLASX DSYMM is 32.4% higher than the second best implementation by cuBLAS-XT. There are 4 major factors that contribute to the success of BLASX, which are 1) the demand-driven load balancing, 2) the seamless GPU occupancy, 3) the significantly less communication volume and 4) the e cient GPU-GPU P2P communication. We investigate each factors as follows. First, our demand-driven dynamic load balancing is key to the BLASX high performance. In Fig.8 , we dissect each GPU execution profiles into 3 major components-the computation (COMPT), the unoverlapped communication (COMM), the synchronization and free gaps among kernel launches (OTHER). A typical ideal scheduler allows each GPUs, regardless of di↵erences, to spend identical time during the execution. Fig.8 indicates that dynamic schedulers employed by BLASX and PaRSEC are better than static schedulers by MAGMA and cuBLAS-XT. For example, the average elapsed time di↵erences between the fastest GPU and the slowest GPU of cuBLAS-XT and BLASX are 0.2961 and 0.0391 seconds; and the same metric for MAGMA (only count COMM) and BLASX is 0.7837 and 0.0457 seconds. The DGEMM of PaRSEC and BLASX attains comparable performance of 0.0252 and 0.0285 seconds.
Second, the seamless GPU occupancy enables BLASX to fully saturate each GPU. In Fig.8 , BLASX demonstrates the least none-computation cost (OTHER+COMM). The communication is counted toward latency if it is not overlapped. The average communication latency of BLASX is 0.0575s while cuBLAS-XT is 0.4917s. The di↵erence is largely due to the significantly reduced communication volume and the seamless stream-level overlapping and kernel launches. OTHER includes synchronization latency and the minor GPU idle gaps among kernel launches. The synchronization is necessary to ensure the mathematical rigorousness; and idle gaps depend on the tightness of kernel launches on multistreams. Increasing streams, as demonstrated by Wei et al [7] , im- Table 5 : The communication volume(in MB) of GPU a,b,c at the input square matrix size N = 16384 on Everest. † The red represents the volume of GPU to GPU communication.
‡ The black represents the volume of bidirectional Host to Device communication.
• The Peer access is only available between GPU2 and GPU3 on the machine Everest.
proves the GPU saturation by reducing those gaps. BLASX dynamically interleaves tasks over multiple streams, while cuBLAS-XT adopts two. Third, the hierarchical tile caches in BLASX dramatically diminish the communication volume. According to TABLE 5, the average communication volume of cuBLAS-XT, 15143 MB, is 2.95 times of BLASX, 5132 MB. The L1 tile cache of BLASX exploits the tile temporal locality to minimize global communication, which is not the case for cuBLAS-XT. The increasing gaps among the GPU clock frequency (1.43 TFLOPS on K40c), GPU memory bandwidth (288 GB/sec on K40c) and PCI-E bandwidth (31.51 GB/s v4.0 x16) identify GPU o↵-chip memory access extremely expensive. Consequently, the excessive data transfer of cuBLAS-XT incurs a huge latency penalty to its performance. The DGEMM data also indicates that BLASX (6219 MB) saves 12% communication over PaRSEC (6961 MB).
Reducing the CPU-GPU communication to the GPU-GPU communication further improves the communication e ciency of BLASX. One of the defining features of BLASX is the implementation of L2 tile caches, the purpose of which is to retrieve tiles from hardware adjacent GPUs under L1 misses. TABLE 4 justifies our L2 tile cache proposal indicating that the average GPU-GPU data transfer is 19.27% faster than the CPU-GPU transfer. We highlight the GPU-GPU communication volume in TABLE 5. The interGPU communication only happens between GPU2 and GPU3 as only them share the same PCI-E switch on Everest. We expect the e cient GPU-GPU communication eventually become a performance critical component with more GPUs available on the system. Fig.9 presents the SGEMM performance of cuBLAS-XT and BLASX on Makalu, equipped with 2 TITAN-X and 2 K40c. The benchmark is conducted under the same experiment setup. A TITAN-X and K40c deliveries the peak SGEMM performance of 5300 GFLOPS and 3000 GFLOPS respectively. cuBLAS-XT adopts a static scheduling strategy that evenly distributes tasks to each GPU. This strategy fails to consider the performance variation of heterogeneous GPUs. Therefore cuBLAS-XT su↵ers from the poor scala- bility on Makalu. Particularly the 4 GPU performance of cuBLAS-XT is worse than the 2 GPU performance due to the improper load balancing. In contrast BLASX demonstrates linear scaling on Makalu because of the dynamic scheduling runtime. Although the first two GPUs are TITAN-X, cuBLAS-XT does not cache tiles on the GPU RAM; while BLASX minimizes communication with two level hierarchical tile caches. Fig.10 demonstrates the runtime overhead with respect to the total execution time at di↵erent problem sizes. The majority of our scheduling and cache algorithm overlaps with the GPU computation. Each BLASX call still involves the overhead such as tile cache initialization, the thread creation and the taskization. We calculate the runtime overhead as the total execution time minus the GPU execution time. The runtime overhead is trivial when N > 1000. Whereas the runtime overhead drastically increases when N < 500. Therefore BLASX is not suitable for repetitive small matrix operations.
The Only Tuning Parameter-Tile Size
We strive to develop a portable L3 BLAS library that delivers the optimal performance on di↵erent platforms with the least e↵ort from users. In BLASX, the tile size is the only tuning parameter. Generally a large tile size reduces the degree of parallelism while a small tile size poorly saturates GPU and PCI-E. The ideal tile size is a result of trade-o↵ among GPU saturation, PCI-E e ciency and available parallelism. Fig. 11 Figure 11 : The performance variations with respect to different tile sizes on Everest.
the overall DGEMM performance at two matrix sizes. The performance increases with tile size and reaches a plateau eventually. In this case, we choose the tile size 1024 ⇥ 1024 for benchmarks on Everest as a result of tradeo↵ between the system e ciency and the parallelism.
Applications
BLASX intends to provide multiGPU performance boost by directly replacing the CPU based alternatives. Libraries such as MAGMA and PaRSEC requires extensive changes on the legacy code to comply with standards while cuBLAS-XT requires commercial licenses without delivering a proportional increase of performance. In this section, we present a few applications of BLASX as follows: a) Ca↵e [15] is one of the most popular deep learning frameworks nowadays, in which BLAS computes the image convolution [19] , forwards and backwards passes of densely connected layers [16] . We built a CPU version of Ca↵e and changed the BLAS linkage to BLASX. An Artificial Neural Network (ANN) of architecture, 3072!16384!16384!10, was trained with CIFAR-10 dataset [17] to classfiy images in 10 categories. We used Ca↵e's benchmark utilities to measure the average elapsed time for a forward and backward passes on the machine Makalu. The experiment data demonstrates that BLASX accelerates ANN training up to 2.48 w.r.t Ca↵e GPU training and 62.3 w.r.t Ca↵e CPU training. In terms of model parameters, the Ca↵e's single GPU training can accommodate up to 1.5 ⇤ 10 9 parameters on a 12 GB GPU. BLASX, however, enables a model of 3.2 ⇤ 10 10 parameters on a multiGPU server with 256 GB host RAM 2 . Please refer to our poster [18] for more details b) MATLAB, R and Octave are renowned technical computing languages widely used in academia and industry. These scientific languages o✏oad the primitive matrix or vector operations to the BLAS to ensure the performance. Users can integrate BLASX with MATLAB by simply exporting the environment variable BLAS VERSION to the location of BLASX. Although MATLAB has the GPU computing toolbox, it requires users to manually manage the data on the GPU yet without multiGPU support. BLASX elegantly resolves these issues with its underlying dynamic runtime. Table 6 presents a few exemplary instances of MATLAB's SIMULINK libraries while using BLASX.
CONCLUSION
Existing L3 BLAS implementations su↵ers from issues such as backward compatibility, insu cient communication and computation overlapping, ine cient communication and poor scalability. In this paper, we design and implement BLASX, a suite of L3 BLAS, that delivers the best L3 BLAS performance on heterogeneous multiGPU systems. We introduce a novel two level hierarchical tile cache to reduce the global communication; Our locality aware scheduling runtime perfectly balances the load across heterogeneous GPUs and CPUs. We optimized communication and computation overlapping on streams to renders trivial communication cost; thereby BLASX computes in a out-of-core fashion that ensures input and output data always remains on the host RAM. Extensive benchmarks demonstrate that BLASX consistently outperforms the leading industrial and academia related projects in terms of performance, scalability, and communication e ciency. More importantly, BLASX requires the least e↵ort from users to integrate with the vast amount of legacy BLAS based applications.
