BLASX: A High Performance Level-3 BLAS Library for Heterogeneous
  Multi-GPU Computing by Wang, Linnan et al.
BLASX: A High Performance Level-3 BLAS
Library for Heterogeneous Multi-GPU Computing
Linnan Wang∗, Wei Wu†, Jianxiong Xiao‡, and Yi Yang§
∗ Georgia Institute of Technology
† The University of Tennessee, Knoxville
‡ Princeton University
§ NEC Laboratory
Abstract—Basic Linear Algebra Subprograms (BLAS) are a
set of low level linear algebra kernels widely adopted by appli-
cations involved with the deep learning and scientific computing.
The massive and economic computing power brought forth by
the emerging GPU architectures drives interest in implementation
of compute-intensive level 3 BLAS on multi-GPU systems. In
this paper, we investigate existing multi-GPU level 3 BLAS and
present that 1) issues, such as the improper load balancing,
inefficient communication, insufficient GPU stream level con-
currency and data caching, impede current implementations
from fully harnessing heterogeneous computing resources; 2)
and the inter-GPU Peer-to-Peer(P2P) communication remains
unexplored. We then present BLASX: a highly optimized multi-
GPU level-3 BLAS. We adopt the concepts of algorithms-by-tiles
treating a matrix tile as the basic data unit and operations
on tiles as the basic task. Tasks are guided with a dynamic
asynchronous runtime, which is cache and locality aware. The
communication cost under BLASX becomes trivial as it perfectly
overlaps communication and computation across multiple streams
during asynchronous task progression. It also takes the current
tile cache scheme one step further by proposing an innovative
2-level hierarchical tile cache, taking advantage of inter-GPU
P2P communication. As a result, linear speedup is observable
with BLASX under multi-GPU configurations; and the extensive
benchmarks demonstrate that BLASX consistently outperforms
the related leading industrial and academic projects such as
cuBLAS-XT, SuperMatrix, MAGMA and PaRSEC.
Keywords—BLAS, scheduling runtime, tile algorithms, multiG-
PUs, hierarchical tile caches
I. INTRODUCTION
Matrix scaling, additions and multiplications are basic
operations in linear algebra libraries, scientific simulations
and deep learning. Offloading these operations to the BLAS
library, which is the case in general practices, can substantially
improve the application performance due to the architecture
specific optimizations inside BLAS kernels. This leads BLAS
to become the standard building blocks for performing low
level matrix operations in applications. Hence, the BLAS
library directly affects the application performance. In the
last few years, the evolving GPU architectures, NVIDIA’s
Kepler and Maxwell in particular, feature thousands of stream
processors, proven to be extremely efficient in computing level
3 BLAS.
While a multi-GPU system offers appealing high perfor-
mance, using it entails nontrivial effort. A multi-GPU system
typically consists of at least one CPUs connected with periph-
eral GPUs on the PCI-E. GPU operates on its private onboard
RAM while CPU operates on the host RAM. For GPUs sharing
0BLASX is publicly available at https://github.com/linnanwang/BLASX
the same I/O hub, they can directly communicate via PCI-E
switch referred to as GPU P2P access. The following factors
need to be considered to fully utilize such architecture: (1)
nowadays GPUs of different architectures represent divergent
computing capabilities; and even the realtime performance of a
GPU varies with respect to factors such as kernel saturation and
GPU occupancy, all of which pose a great challenges to load
balancing. (2) Minimizing and overlapping communication is
key to achieve high performance. (3) Reducing the CPU-
GPU communication to the GPU-GPU communication further
improves the communication and energy efficiency. Unfor-
tunately, our study indicates the existing multi-GPU level-3
BLAS fail to optimize toward these factors, thereby delivering
sub-optimal performance.
In this paper, we present the BLASX: a high-performance
level-3 BLAS library for heterogeneous multi-GPU systems.
We address the load balancing with a dynamic scheduling
runtime, which handles task level workload variations, single
GPU realtime performance variations and speed discrepancies
among heterogeneous multi-GPUs. BLASX adopts a novel
two level hierarchical tile caches to explore the tile tempo-
ral locality, in which we consider the GPU onboard RAM
as the L1 tile cache and the combined multi-GPU RAMs
as the L2 tile cache. The L1 tile cache minimizes global
communication; and the L2 tile cache successfully reduces the
CPU-GPU communication to the GPU-GPU communication.
In implementing this hierarchical tile caches, we propose a
new LRU algorithm to accommodate the asynchronous task
progression and a new cache coherence protocol to ensure
the data consistency on multi-GPUs. BLASX also optimizes
the communication/computation overlapping on GPU streams
so that the communication cost is negligible. Finally, BLASX
offers backward compatibility to the vast existing CPU BLAS
based applications; thereby all the details, such as workload
balancing, data caching, communication overlapping and mem-
ory management, can be ignored by library users.
We evaluate BLASX on two multi-GPU systems, Everest
and Makalu (TABLE II), against the related leading academic
and industrial projects including cuBLAS-XT, SuperMatrix,
MAGMA and PaRSEC. BLASX consistently outperforms the
academic implementations on Everest with 3 NVIDIA Kepler
K40c. In contrast to the highly optimized NVIDIA cuBLAS-
XT, BLASX demonstrates on average 25% performance gain
and 200% less communication volume. Makalu features 2
Kelper K40 and 2 Maxwell TITAN X. BLASX successfully
tackles the heterogeneity and demonstrates linear speedup;
whereas other libraries such as cuBLAS-XT, MAGMA and
SuperMatrix suffers from poor scalability.
ar
X
iv
:1
51
0.
05
04
1v
1 
 [c
s.D
C]
  1
6 O
ct 
20
15
(a) SuperMatrix
(b) StarPU
(c) cuBLAS-XT
(d) BLASX
Fig. 1: A snapshot of single GPU DGEMM execution profile
from SuperMatrix, StarPU, cuBLAS-XT and BLASX.
We organize the remaining paper as follows. Section II
analyzes the background and related works; and section III
briefly reviews the L3 BLAS tile algorithms. In Section IV, we
elaborate the detailed design and implementations of BLASX
including two level hierarchical tile caches and the scheduling
runtime. We also present solutions to specific questions such
as amortizing high frequency memory allocation and deal-
location, communication and computation overlapping. The
comprehensive evaluations of BLASX against existing state-
of-art implementations are presented in the Section V. Finally,
we conclude at Section VI.
II. BACKGROUND AND RELATED WORK
There are three levels of BLAS, divided with respect to the
complexity of operations. Level-1 (L1) BLAS targets vector
operations in O(n) such as vector dot products and vector
norms. Level-2 (L2) BLAS targets matrix-vector operations
in O(n2) such as matrix-vector multiplication. Level-3 (L3)
BLAS [1] targets matrix operations in O(n3) time such as
matrix-matrix multiplications. The focus of this research is on
L3 BLAS, which uses General Matrix Multiplication (GEMM)
as the primary building block for the routines within the
category. Therefore, the task of improving the performance
of L3 BLAS can be reduced to the GEMM speed.
The massive but economic TFLOPS brought forth by the
evolving GPU architectures drives interests in the various
implementations of multi-GPU L3 BLAS. SuperMatrix [2]
is one of the pioneers of matrix operation parallelization
on SMP multicores, however it provides limited support on
GPUs. The key insight of SuperMatrix is that a matrix can
be factorized into a set of tiles. The Tomasolu algorithm [3]
subsequently schedules these tiles in the out-of-order fash-
ion. Fig.1a demonstrates that SuperMatrix suffers from costly
nonoverlapped CPU-GPU data transfers. StarPU provides a
centralized interface to various accelerator technologies [4]. In
contrast to Supermatrix, StarPU supports versatile scheduling
algorithms such as work stealing [5] and priority scheduling
[6] while requiring manual annotations to optimize under a
specific problem. The insufficient communication/computation
overlapping and the low GPU saturation in Fig.1b demonstrate
the suboptimal DGEMM implementation in StarPU. MAGMA
[7] is another multi-GPU linear algebra library with incomplete
LAPACK and BLAS support. It is a heavily hand tuned library
relying on a static load balancer, which degrades MAGMA’s
performance on heterogeneous multi-GPU systems. Direct
Acyclic Graph (DAG) scheduling has seen a revival in the
recent years as it can naturally integrate with tile algorithms.
PaRSEC [8] is a leading DAG scheduling runtime for dense
linear algebraic operations. Building DAGs at runtime and
scheduling tasks within DAGs, however, can be a huge cost
for the small scale L3 BLAS operations. PaRSEC also assumes
constant workload on each fine-grained task and constant speed
on each GPU. It is possible to have workload variations and
the GPU kernel saturation also affects the actual execution
speed. In addition, PaRSEC only exploits tile reusing within a
single GPU; Caching on multiGPU memory spaces by taking
advantages of GPU-GPU P2P communication still remains
unexplored. None of the aforementioned libraries are backward
compatible to legacy CPU BLAS. As an reaction to the
market, NVIDIA released a commercial multiGPU L3 BLAS,
cuBLAS-XT [9], declaring it to be backward compatible when
using the NVBLAS wrapper. cuBLAS-XT consistently moves
tiles on demand into GPU RAM so that it can compute a
large scale problem with a few MB of GPU RAM. Although
major communication is overlapped, it does not address tile
caching; and this aggressive on demand communication pattern
extremely overloads the PCI-E as shown by the contiguous
yellow blocks in Fig.1c.
In summary, these libraries cannot deliver the optimal
performance due to following issues: 1) insufficient commu-
nication/computation overlapping subjects SuperMatrix and
StarPU to suboptimal performance. 2) excessive communica-
tion in cuBLAS-XT overloads the PCI-E dragging down the
overall performance. 3) low GPU occupancies in SuperMatrix
and StarPU indicate partial GPU utilization. 4) efficient GPU-
GPU P2P communication remains unexplored. We observe
that the average throughput of CPU-GPU communication
is 6.54 GB/S while the GPU-GPU is 7.80 GB/S. 5) static
scheduling in the cuBLAS-XT and MAGMA cannot tackle the
hardware heterogeneity. BLASX successfully resolves these
issues as demonstrated in Fig.1d; please refer to Performance
Evaluation for detailed discussion.
III. A REVIEW OF L3 BLAS TILE ALGORITHMS
In this section, we give an overview of L3 BLAS tile
algorithms. L3 BLAS is intended for O(n3) matrix operations
including General Matrix Multiplication (GEMM), symmetric
rank-k update (SYRK), symmetric rank-2k update (SYR2K),
triangular matrix multiplication (TRMM), symmetric matrix
multiply (SYMM), and triangular solve with multiple right
hand side (TRSM). Since the Hermitian matrix multiplication
(HEMM), Hermitian rank-k update (HERK), and Hermitian
rank-2k update (HER2K) are the complex counterparts of
GEMM, SYRK and SYR2K respectively, we omit their dis-
cussion in this paper.
A. Representing Matrix as Tiles
The tile algorithm logically partitions a matrix into its tiled
representation. Given tile size T in a matrix of size N ×M ,
it creates bN/T c × bM/T c square tiles of size T × T and
(dN/T e × dM/T e) − (bN/T c × bM/T c) non-square tiles.
TABLE I. GEMM percentages at 3 different matrix sizes N.
Routines N=5K N=10K N=20K
SYRK 74.5% 86.3% 92.8%
TRSM 68.5% 80.4% 89%
TRMM 69% 81.5% 92.8%
SYR2K 74.4% 85.4% 92.9%
SYMM 71.7% 84.9% 92.1%
Furthermore, the algorithm treats tiles, uniquely indexed by
row and column, as the basic elements in a matrix in lieu of
scalars. Operations on matrices are subsequently reduced to
operations on tiles. As our focus is the L3 BLAS, we assume
the tile indices of the output matrix are [i, j], the tile indices
of the matrix to the left side of multiply operator are [i, k] and
the tile indices of matrix to the right of multiply operator are
[k, j]. Hence tile indices i and j uniquely identify a tile, Cij,
in the output matrix while the upper bound of k represents the
computational intensity to solve the Cij.
B. Traditional GEMM based L3 BLAS Tile Algorithms
The traditional tile implementation of L3 BLAS on the
CPU relies on a highly optimized GEMM and a small amount
of L1 and L2 BLAS [10] or other L3 BLAS routines. The fol-
lowing equations illustrate the none-transpose, upper triangular
cases of L3 BLAS tile algorithms: GEMM, SYRK, TRSM,
TRMM, SYR2K and SYMM, respectively:
Cij = α
z∑
k=0
AikBkj + βCij (1a)
Cij = α
z∑
k=0
AikA
ᵀ
jk + βCij (1b)
Cij = αA
−1
ii
(
Bij −
z∑
k=i+1
AikCkj
)
(1c)
Cij = α
z∑
k=i+1
AikCkj +AiiCij (1d)
Cij = α
z∑
k=0
AikB
ᵀ
jk + α
z∑
k=0
BikA
ᵀ
jk + βCij (1e)
Cij = α
i∑
k=0
AᵀkiBkj + α
z∑
k=i+1
AikBkj + βCij (1f)
C. A simple trick to Matrix Transpose
The transpose of a tiled matrix requires a transpose on
each tiles in addition to swapping the tiles Aij and Aji to be
consistent with the definition of matrix transpose. The concept
is as follows:[
A00 A01
A10 A11
] [
B00 B01
B10 B11
]ᵀ
=
[
A00 A01
A10 A11
] [
Bᵀ00 B
ᵀ
10
Bᵀ01 B
ᵀ
11
]
This simple trick significantly facilitates the matrix transpose.
Rather than physically transposing the entire matrix, we can
retrieve the tile Aji and transpose the tile inside the BLAS
kernel to implicitly transpose the matrix.
D. The GEMM Dominant L3 BLAS
TABLE I presents the GEMM percentages in L3 BLAS
with respect to 3 square matrix sizes. The percentages of
GEMM, as indicated in 1a to 1f, increase with the size of
matrices to a point that the entire computation eventually
be dominated by the GEMM kernel, achieving GEMM-like
performance.
IV. BLASX: A MULTI-GPU L3 BLAS LIBRARY WITH A
LOCALITY AWARE DYNAMIC SCHEDULING RUNTIME
We organize this section as follows. We begin by introduc-
ing a set of new L3 BLAS tile algorithms for heterogeneous
multi-GPUs. Subsection A elaborates the L3 BLAS task-
ization; Subsection B elaborates a novel two-level hierarchical
tile cache on multi-GPU RAMs; Subsection C elaborates our
locality aware scheduling runtime covering load balancing,
scheduling infrastructure, and scheduling strategies; Subsec-
tion D elaborates communication/computation overlapping and
GPU out-of-core operations. In the end, we present a novel
heap design to amortize the overhead introduced by high-
frequency GPU memory allocation/deallocation.
Algorithm 1: A new L3 BLAS tile algorithm for hetero-
geneous multi-GPUs
Data: A,B and C
Result: C
1 begin
2 NoneBlockingTaskQueue(TQ)← init(A,B,C)
3 CacheCoherenceProcotol(CCP )← init()
4 for gpu ∈ GPUs do
5 SpawnThread(ComputingKernel, [TQ,CCP ])
6 AllThreadsJoin()
7
8 Function ComputingKernel (TQ,CCP )
9 BindToCore(self)
10 while TQ 6= ∅ do
11 for RS ∈ ReservationStation(RS) do
12 if RS = ∅ then
13 task ←
Dequeue(TQ) or WorkStealing()
14 priortiy ← CalculatePriority(task)
15 RS ← {task, priortiy}
16 StreamsSynch()
17 ReaderUpdate()
18 for k do
19 for task ∈ Top 4 Prioritized Tasks in RS do
20 i← task, j ← task
21 stream idx← task
22 Aik ← CCP (&(task → Aik))
23 Bkj ← CCP (&(task → Bkj))
24 SetStream(stream idx)
25 AsyncBLAS(Aik,Bkj,Cij)
Alg.1 describes the skeleton of the proposed L3 BLAS
algorithms for heterogeneous multi-GPUs. Lines 1 to 6 in-
dicate the general runtime procedures; and lines 8 to 25
indicate the GPU specifics during the computation. In lines
1 to 6, the runtime initializes a hierarchical tile cache and
a global non-blocking task queue. Then a CPU thread is
spawned for each GPU to submit instructions. The threads
join together after the global task queue depletes. In lines 8
to 25, GPUs concurrently retrieve tasks and interleave them
via multi-streams to overlap the communication during the
asynchronous progression. The line 13 indicates two ways
Fig. 2: The proposed two levels hierarchical tile caches. The
L1 tile cache is the GPU onboard RAM; the L2 tile cache
is the combined RAMs of GPUs on the same PCI-E switch.
Each block holds a tile on a contiguous independent memory
segment.
of task retrieval, either from global task queue or working
stealing. A GPU steals tasks from other Reservation Stations
(RS), which only happens when the GPU exhausts tasks on RS
while the global queue is also empty. Lines 22 and 23 reflect
tile reuses with the proposed tile caches. Line 25 suggests an
asynchronous L3 BLAS kernel invocation, the type of which is
dictated by Eq.1 given the tile indices i, j and k. Lines 18, 19
and 24 indicate the communication/computation overlapping.
Although we implement this algorithm on NVIDIA GPUs, we
can easily migrate it to work on other accelerator technologies
such as Intel Xeon Phi and AMD FirePro.
A. Taskizing L3 BLAS
We define the task as solving a Cij in Eq.1. Tile algorithms
yield the insight to systematically break down the output
matrix C into a set of tiles Cij, the computation of which,
involves reading Aik, Bkj, Cij and an independent write of
Cij. Hence concurrently solving the Cij is data hazards free.
Eq.1 also indicates that the workload of Cij in non-GEMM
routines varies according to the upper bound of k. In summary,
a task has 3 notable properties by following definition:
• Reading the inputs for a task is data dependency free.
• Concurrent writing a task’s output is data race free.
• The workload of each task varies.
Given a matrix C of size M ×N and tile size T, the degree
of parallelism is as follows:
degree-of -parallelism = dM/T e ∗ dN/T e (2)
In our implementation, a task holds the necessary metadata to
solve a Cij such as tile indices i, j and k, the dimensions of
Cij , and its host address. The runtime virtually slices a matrix
and stores the tile metadata in tasks. Consequently, taskizing
a L3 BLAS does not require significant additional memory.
B. Two Level Hierarchical Tile Caches
Fig.2 demonstrates the structure of separate memory spaces
on the multiGPU system. Each GPU equips with a private
RAM; and CPUs share the host RAM. Nowadays, GPU RAM
can be up to 12 GB while a single double precision matrix of
size 32768∗32768 is 8.6 GB. The relatively small GPU RAM
capacity limits the GPU in-core computing from handling the
large scale L3 BLAS. One solution using tile algorithms is
to dissect a large L3 BLAS into smaller ones, and solve
them in succession. At each time, we move in tiles from
the host on demand. Frequent GPU off-chip memory access,
however, overloads the PCI-E and degrades the performance to
be suboptimal. Meanwhile it is possible to reuse certain tiles
in separate tasks as indicated by Eq.1. Therefore, exploiting
the tile temporal locality by caching the most frequently used
ones on the GPU RAM is necessary.
We present a novel two level fully associative tile caches in
Fig.2. The L1 tile cache is implemented using GPU onboard
RAM; the L2 tile cache is implemented using the combined
memory spaces of GPUs, which share the same I/O hub. The
L1 tile cache hit enables direct tile reuse; the L2 tile cache
hit reduces the CPU-GPU communication to GPU-GPU com-
munication by retrieving the tile from the hardware adjacent
GPU. The rational of implementing the L2 cache are twofold:
(1) GPU P2P communication better saturates PCI-E delivering
at least 6 GB/s performance. [11] (2) The comparably faster
GPU RAM reduces the latency of data fetching. Therefore it is
more cost effective to retrieve a tile from a GPU than from the
host RAM. To the best of our knowledge, BLASX is the first
linear algebra library that considers such tile cache hierarchies
on multi-GPU systems.
Algorithm 2: The ALRU Operations
Data: Tile Host Address (HA) and ALRU
Result: GPU Address (GA)
1 Function Translate (ALRU , HA)
2 LRUBlock(LB)← ALRU.HashMap(HA)
3 if LB = ∅ then
4 GA←Malloc(TileSize)
5 if GA = ∅ then
6 GA← ALRU.Dequeue()
7 ALRU.Enqueue(HA,GA)
8 return GA /* new tile cache */
9 else
10 return LB.GA /* cache hit */
11 Function Dequeue ()
12 LRUBlock(LBEnd)← ALRU.end
13 while LBEnd 6= ALRU.begin do
14 if LBEnd.Reader = 0 then
15 remove the LBEnd from the ALRU
16 return LBEnd.GA
17 else
18 LBEnd← LBEnd.Previous
19 Function Enqueue (HA, GA)
20 LRUBlock(LB)←Malloc(LRUBlock)
21 LB.HA← HA, LB.GA← GA
22 LB.Reader ← 0
ALRU.HashMapInsert(HA,GA)
23 ALRU.InsertFront(LB)
To implement the L1 tile cache, we need a LRU to discard
the least frequently used tiles. Unfortunately the vanilla LRU
algorithm [12] can not accommodate the asynchronous kernel
launches in BLASX. Consequently, we propose the Approxi-
mate Least Recent Used (ALRU) to handle asynchronicities.
Alg.2 presents the three basic operations in the proposed
ALRU. It adopts a reader (line 22) to track the tile usage.
Specifically the reader of a tile is atomically incremented if
a task need it. On the other hand, the reader is atomically
decremented if a task releases the tile. We only update readers
promptly after the synchronization point because that’s the
only place to inform the tile status (line 17 in Alg.1). A
Fig. 3: The state transition diagram of MESI-X protocol
in BLASX. The red M state is an ephemeral state, which
immediately transits to I state by writing the tile back to host
RAM.
reader equals 0 indicating no tasks using it; thereby it can be
released in the ALRU. Since the runtime does not immediately
synchronize readers in the asynchronous progress, it is possible
to have nonzero reader on the least recent used tile. This
deviates from the vanilla LRU policy by discarding the first
approximate (as opposed to the exact) least recent used tile
having the zero reader, which is reflected by lines 14-18 in
Alg.2.
To implement the L2 tile cache, we need a cache coherence
protocol to ensure the consistency of shared tiles in multiple
places. We adopt the MESI-X cache coherence protocol, a
variant of MESI protocol [13], to work for the BLASX’s
particular context. Each ALRU associates with a specific GPU;
The ALRUs all together reflect tile states in accordance with
MESI-X protocol as follows: A tile is at E state if only a ALRU
tracks it; A tile is at S state if multiple ALRUs track it; A tile
is at I state if no ALRUs track it; A tile is at M state if a GPU
writes a Cij to it. Unlike the regular MESI protocol, we define
the M state as an ephemeral state that writes back any tile in
the state to the host RAM and sets the tile state immediately
to I. Fig.3 demonstrates the state transition diagram of the
proposed MESI-X.
C. Scheduling Infrastructure and Strategies
Our scheduling runtime achieves three specific goals:
the proper load balancing on heterogeneous multi-GPUs and
multi-CPUs, the efficient communication with locality aware
scheduling and the sufficient overlapping of computation and
communication. Fig.4 presents the scheduling infrastructure in
our locality aware scheduling runtime, which consists of 4
major components:
1) GPU Computation Thread: a CPU thread to submit
tasks for a specific GPU. To avoid the OS scheduling pre-
emption, we bind the thread to a dedicated CPU core. The
communication/computation overlapping requires at least 2
tasks concurrently running on streams; while Wei et al. [8]
demonstrate no performance gain when using more than 4
streams. This leads us to adopt 4 concurrent tasks to overlap the
computation/communication, which also explains the 4 streams
used in Fig.4.
2) CPU Computation Thread: a CPU thread to submit tasks
for the rest of CPU cores. Peng et al. proposes the hybrid tile
layout to CPU Cores and GPUs due to the inherent devices’
performance differences [14]. We adopt the same concept but
different approach. The CPU cores dequeue one task at each
time and solve the task with a multithreaded BLAS kernel,
where the tile is further factorized.
3) Reservation Station (RS): a buffer designed to hold
the upcoming tasks for a GPU. The runtime conducts work
Fig. 4: The runtime infrastructure for our locality aware
dynamic scheduler. G2H is the GPU to Host data transfer while
H2G is the reverse.
stealing and priority scheduling on it. Each slot of a RS
conveys a task priority, a task metadata, and a stream index.
4) Non-blocking Task Queue: It is a non-blocking queue
allowing efficient concurrent dequeue and enqueue operations
based on the algorithm proposed by Maged and Michael [15].
On heterogeneous systems, tasks can be executed on any
computing device with load balancing being key to achieve
optimal performance. In reality, it is possible to have task
workload variation computed on the processors of various
speeds. The two effects accentuate the uncertainty of proces-
sors on tasks consumption speed. One simple load balancing
solution is to distribute tasks based on the inherent processors’
speeds, which is the case in PaRSEC; the actual execution
time, however, changes with the GPU kernel saturation and
the actual tasks’ workload. Hence this solution may lead
heavy tasks, however rare to clog the slower processor(s).
Our load balancing scheme leverages the task-level workload
and the processors’ real time speed to achieve the optimal
performance. We treat GPUs about to entering idle states as
a sign of demand, causing the thread to dequeue tasks. The
key of our dynamic task-scheduling runtime is that processors
simultaneously pull out tasks from the global non-blocking
task queue by their demands. Specifically, faster processors
consume more tasks and initiate more demands while the
slower processors consume fewer and demand less. The load
is thereby adjusted according to the real time demand of
individual processors. The perfect scenario in this case is such
that processors, regardless of speed difference, spend identical
time on task execution without idling.
We adopt the work sharing and the work stealing to achieve
this demand driven load balancing on the proposed runtime
infrastructure. Lines 10 to 13 in Alg.1 indicate each GPU
populates the affiliated RS either through global task queue or
work stealing. The global non-blocking task queue simulates
the work sharing by enabling concurrent task retrieval from
multi-GPUs. Since the line 16 in Alg.1 is a synchronization
point, a GPU will not demand tasks from RS unless it finishes
current ones. Hence GPUs that demand more attempt to pull
out more tasks from the queue. The work stealing intends
to further improve the load balancing under the situation of
Fig. 5: Performance degeneration when increase the matrix
size using CudaMalloc and CudaFree.
Fig. 6: A fast heap design to amortize the GPU memory
allocation/deallocation overhead.
empty global task queue but full RS on GPUs. In this case, a
underutilized GPU or CPU takes the initiative to steal a task
from a overloaded RS, further balancing tasks at a finer level.
Prioritizing tasks with the better temporal locality lessens
unnecessary communication. Lines 14, 15 and 19 in Alg.1
demonstrate task prioritization in the runtime. At each k, a
GPU fetches top 4 prioritized tasks from the affiliated RS. The
runtime refreshes the priorities in RS after new tasks coming
in. The runtime populates tasks’ priorities based on the extent
of potential cache hits, which follows:
priority =
z∑
k=0
(f(Aik) + f(Bkj)) (3a)
f(X) =

0, Otherwise
1, if L2 cache hit
2, if L1 cache hit
(3b)
The tile locality has 3 scenarios: it hits the L1 tile cache; it
hits the L2 tile cache and it is located at the host RAM. In
terms of communication cost, there is no cost for L1 cache
hit, and the L2 cache hit incurs less cost than retrieving the
tile from the host.
D. Overlapping Computation with Communication
The CUDA stream is sequential operations executed in the
issued order on the GPU with two notable properties. First, the
operations on different streams can be simultaneously executed
within the same physical device. This property enables the
communication/computation overlapping via moving the data
on one stream while executing kernels on another one. Sec-
ondly, streams can divide the GPU processing power between
a few tasks by allocating segments to each task in turn. With
these two properties, a tight interleaving of tasks on multiple
streams can render the actual communication cost trivial.
L3 BLAS tile algorithms indicate that a task essentially
involves k ∈ [1, z] steps of kernel execution. The runtime
overlaps the communication/computation by interleaving tasks
as follows: First, the RS directly maps top 4 prioritized tasks
TABLE II. The system configuration of experimental machines
System Configuration
Everest Makalu
OS CentOS v. 6.3 Ubuntu Server 14.04
GPU 3 Kelper K40 2 Kelper K40 and 2 Maxwell Titan X
CPU 2×Xeon E5 4655 V3 2 Xeon E5 1620 V3
RAM 64 GB DDR3 64 GB DDR3
CUDA v 6.5 v 7.0
Compiler GCC 4.4.7 GCC 4.8.2
CPU BLAS OpenBLAS v 1.13 OpenBLAS v 1.13
onto 4 CUDA streams. Second, the runtime executes each step,
k, on all streams such that tasks in the step are computed
before advancing to the next step (line 19-25 Alg.1), which
is guaranteed by the sequential execution feature of CUDA
stream. Solving a step of tasks involves data transfer of the
required tiles (if cache miss) followed by kernel executions. As
tasks progress on multiple streams in this way, the data transfer
on a stream eventually overlaps with the kernel execution on
another one. In sum, we can consider the overall execution time
to be the sum of kernel execution time, plus initial tile move
in, and plus final tile move out. This negligible communication
cost enables the input and output matrices to reside on the host
memory. Consequently we claim our GPU operations are out-
of-core.
E. Amortize GPU Memory Allocation/Deallocation Overhead
GPUs require memory allocation for tiles move-in and
deallocation for tiles move-out. Increasing the computation
scale leads to the performance deterioration due to the sig-
nificant overhead of memory allocation/deallocation. Fig. 5
presents the performance degeneration with CUDA’s native
memory management utilities such as cudaMalloc and cud-
aFree. As a consequence, we implement a fast heap based
GPU memory management utilities, BLASX Malloc, to al-
leviate this issue. The core concept of it is to amortize the
allocation/deallocation overhead by adopting a big chunk of
GPU memory as the preallocated heap.
Fig. 6 presents the basic scheme of the proposed heap
design, which consists of a meta-data list, a occupied list and
an empty list. A node in the meta-data list traces the length
of memory segment and its occupation status. Each block
of the occupied and empty list tracks the allocated segment
and the free segment respectively. The dynamics of this heap
is as follows: During the allocation, the heap searches for
the first node with enough memory in the empty list, which
is subsequently split into two nodes. One for the occupied
list recording the allocated memory; The other for the empty
list recording the residual memory. During the deallocation,
the runtime locates the segment in the occupied list with
a hashtable. If either the node’s left or right neighbors are
contiguous with the node in terms of memory, they merge
together. Then the segment is marked as free and placed
back to empty list afterwards. Fig. 5 demonstrates our heap
based method effectively amortizes the memory allocation and
deallocation overhead.
V. PERFORMANCE EVALUATION
In this section, we present comprehensive evaluations of
our L3 BLAS routines. We conducted the experiments on two
shared memory machines Everest and Makalu, the specifica-
tions of which are included TABLE II.
Fig. 7: The comprehensive benchmarks of double precision L3 BLAS on Everest. BLASX demonstrates superior performance
than state-of-art commercial product cuBLAS-XT and academic related projects such as MAGMA and SuperMatrix. The out-
of-core GPU operations in BLASX enables larger scale GPU computing than does the in-core GPU operations by PaRSEC and
MAGMA.
TABLE III. The average parallel efficiency of various imple-
mentations with input square matrix N ∈ [1024, 39936] on
Everest.
Routines BLASX PaRSEC MAGMA cuBLAS-XT SuperMatrix
DSYRK 85.54% N/A N/A 64.21% 33.33%
DTRSM 81.58% N/A 77.3% 77.27% 30.72%
DTRMM 88.99% N/A N/A 82.11% 38.96%
DSYMM 90.36% N/A N/A 57.96% 43.42%
DGEMM 93.53% 92.85% N/A 89.85% 46.22%
DSYR2K 85.54% N/A 79.58% 64.21% 34.70%
TABLE IV. The average throughput of Direct Memory Ac-
cess(DMA) engine.
Bidirectional Host and GPU GPU to GPU
Throughput 6.54 GB/s 7.8 GB/s
A. The Comprehensive L3 BLAS Benchmark
We evaluate performance against a commercial product
cuBLAS-XT and seminal academic libraries such as Super-
Matrix, PaRSEC and MAGMA. We setup benchmarks on the
machine Everest (TABLE II) using double precision L3 BLAS
routines. The memory range of input or output matrices is page
locked to expedite PCI-E transfer; and the time spent on page-
locking is excluded from performance metric. The alpha and
beta are two random float constants; other parameters such as
UPLO, SIDE, TRANS and DIAG are ensured to be same
in each comparison. The input matrix size N starts from 1024
to 39936 at increments of 1024. The performance data are from
the average of 3 runs; execution order of BLASX, cuBLAS-
XT, MAGMA, SuperMatrix and PaRSEC is randomized to
eliminate the potential ordering influence.
Fig.7 demonstrates the comprehensive benchmarks on
Everest. In single GPU benchmarks, the mean performance
of BLASX converges to 92.68% of the in-core cuBLAS
DGEMM peak; whereas the average performance of PaR-
SEC, MAGMA, cuBLAS-XT and SuperMatrix attains 91.10%,
81.28% , 79% and 63.99% of in-core cuBLAS DGEMM
peak respectively. Even though PaRSEC achieves comparable
performance, its GPU in-core operation limits PaRSEC to
handle matrix sizes N > 22528 as the required memory,
22528 ∗ 22528 ∗ 8 ∗ 3 = 12.18 GB, is beginning to exceed
the GPU RAM 12 GB capacity. This also explains partial
benchmarks on the MAGMA DSYR2K and DTRSM. The
sufficient GPU communication/computation overlapping is one
of predominant factors to the high performance of BLASX.
Whereas Supermatrix follows a simple fork and join model
blocking kernel launches until the on demand data is trans-
ferred. The non-overlapped communication in SuperMatrix
incurs too much latency to delivery comparable performance.
Hence we omit its discussion in the rest of paper.
BLASX demonstrates linear speedups and the highest
scalability under multiGPU configurations. Fig.7 indicates per-
formances increase with the matrix size and reaches a plateau
after N > 15000. At the matrix size 16384, the DSYR2K
speedup of BLASX, cuBLAS-XT, MAGMA on two GPUs
are 1.99x, 1.83x, 1.91x; and the triple GPU speedup are 2.91,
2.16, 2.88. However, real world applications often entail small
scale matrix N < 15000. Measuring parallelizations at various
matrix sizes is more convincing than concluding the speedup
(a) DGEMM (b) DSYMM
(c) DTRSM (d) DTRMM
(e) DSYR2K (f) DSYRK
Fig. 8: The execution time profile (classified into COMPT,
COMM and OTHER) at the square matrix size N = 16384
on Everest. The horizontal indices 1, 2, 3 represents GPU 1,
GPU 2 and GPU 3.
at a particular matrix size. Since the parallel efficiency is a
performance metric to describe the scalability at a specific
problem size N , we calculate the average parallel efficiency
based on 39 matrix sizes N ∈ [1024, 39936] to yield a global
insight about scalabilities at various matrix sizes; and we
adopt forward padding to partial benchmarks in MAGMA and
PaRSEC. In TABLE III, BLASX outperforms the second best
alternatives at the average rate of 5%. Particularly BLASX
DSYMM is 32.4% higher than the second best implementation
by cuBLAS-XT. There are 4 major factors that contribute to
the success of BLASX, which are 1) the demand-driven load
balancing, 2) the seamless GPU occupancy, 3) the significantly
less communication volume and 4) the efficient GPU-GPU P2P
communication. We investigate each factors as follows.
First, our demand-driven dynamic load balancing is key
to the BLASX high performance. In Fig.8, we dissect each
GPU execution profiles into 3 major components—the compu-
tation (COMPT), the unoverlapped communication (COMM),
the synchronization and free gaps among kernel launches
(OTHER). A typical ideal scheduler allows each GPUs, re-
gardless of differences, to spend identical time during the
execution. Fig.8 indicates that dynamic schedulers employed
by BLASX and PaRSEC are better than static schedulers by
MAGMA and cuBLAS-XT. For example, the average elapsed
time differences between the fastest GPU and the slowest GPU
of cuBLAS-XT and BLASX are 0.2961 and 0.0391 seconds;
and the same metric for MAGMA (only count COMM) and
BLASX is 0.7837 and 0.0457 seconds. The DGEMM of PaR-
SEC and BLASX attains comparable performance of 0.0252
and 0.0285 seconds.
Second, the seamless GPU occupancy enables BLASX to
fully saturate each GPU. In Fig.8, BLASX demonstrates the
least none-computation cost (OTHER+COMM). The commu-
nication is counted toward latency if it is not overlapped. The
average communication latency of BLASX is 0.0575s while
cuBLAS-XT is 0.4917s. The difference is largely due to the
significantly reduced communication volume and the seam-
less stream-level overlapping and kernel launches. OTHER
TABLE V. The communication volume(in MB) of L3 BLAS
routines at the input square matrix size N = 16384.
DGEMM DSYMM
MB BLASX cuBLAS-XT h-PaRSEC BLASX cuBLAS-XT
GPU1 6895 24528 7563 5628 18865
GPU2 1811+4768 24243 7160 1249+5318 18865
GPU3 721+4462 24243 6160 343+3758 18102
DTRSM DTRMM
MB BLASX cuBLAS-XT MAGMA BLASX cuBLAS-XT
GPU1 2751 12985 15130 4966 8885
GPU2 981+4622 12792 2768 1694+4353 2020
GPU3 300+2575 12876 1587 327+2365 8922
DSYR2K DSYRK
MB BLASX cuBLAS-XT MAGMA BLASX cuBLAS-XT
GPU1 7281 15910 5738 4278 12314
GPU2 2231+5905 15910 5720 1363+3410 12834
GPU3 1308+2969 15900 5720 1213+2536 11492
† The red represents the volume of GPU to GPU communication.
‡ The black represents the volume of bidirectional Host to Device communication.
• The Peer access is only available between GPU2 and GPU3 on the machine Everest.
includes synchronization latency and the minor GPU idle gaps
among kernel launches. The synchronization is necessary to
ensure the mathematical rigorousness; and idle gaps depend
on the tightness of kernel launches on multistreams. Increasing
streams, as demonstrated by Wei et al [8], improves the
GPU saturation by reducing those gaps. BLASX dynamically
interleaves tasks over multiple streams, while cuBLAS-XT
adopts two.
Third, the hierarchical tile caches in BLASX dramatically
diminish the communication volume. According to TABLE
V, the average communication volume of cuBLAS-XT, 15143
MB, is 2.95 times of BLASX, 5132 MB. The L1 tile cache of
BLASX exploits the tile temporal locality to minimize global
communication, which is not the case for cuBLAS-XT. The in-
creasing gaps among the GPU clock frequency (1.43 TFLOPS
on K40c), GPU memory bandwidth (288 GB/sec on K40c)
and PCI-E bandwidth (31.51 GB/s v4.0 x16) identify GPU
off-chip memory access extremely expensive. Consequently,
the excessive data transfer of cuBLAS-XT incurs a huge
latency penalty to its performance. The DGEMM data also
indicates that BLASX (6219 MB) saves 12% communication
over PaRSEC (6961 MB).
Finally, reducing the CPU-GPU communication to the
GPU-GPU communication further improves the communica-
tion efficiency of BLASX. One of the defining features of
BLASX is the implementation of L2 tile caches, the purpose of
which is to retrieve tiles from hardware adjacent GPUs under
L1 misses. TABLE IV justifies our L2 tile cache proposal
indicating that the average GPU-GPU data transfer is 19.27%
faster than the CPU-GPU transfer. We highlight the GPU-
GPU communication volume in TABLE V. The interGPU
communication only happens between GPU2 and GPU3 as
only them share the same PCI-E switch on Everest. We expect
the efficient GPU-GPU communication eventually dominating
with more GPUs available on the system.
Fig.9 presents the DGEMM CPU performance of cuBLAS-
XT and BLASX on the machine Makalu. We sample the CPU
contribution by taking the difference of CPU enabled DGEMM
to CPU disabled DGEMM under same scenarios. Our runtime
automatically assign tasks to CPU; hence the performance is
Fig. 9: The CPU performance of cuBLAS-XT and BLASX at
various CPU ratios.
Fig. 10: The performance variations with respect to different
tile sizes on Everest.
represented with a horizontal line. Fig.9 indicates the CPU
contribution of BLASX is 78% faster than the best case in
cuBLAS-XT. The downtrend in the figure also suggests an
improperly chosen CPU ratio overloads the CPU at the expense
of GPUs.
B. The Only Tuning Parameter—Tile Size
We strive to develop a portable L3 BLAS library that
delivers the optimal performance on different platforms with
the least effort from users. In BLASX, the tile size is the only
tuning parameter. Generally a large tile size reduces the degree
of parallelism while a small tile size poorly saturates the GPU
and PCI-E. The ideal tile size is a result of trade-off among
GPU saturation, PCI-E efficiency and available parallelism.
Fig. 10 demonstrates the impact of tile size on the overall
DGEMM performance at two matrix sizes. The performance
increases with tile size and reaches a plateau eventually. In this
case, we choose the tile size 1024× 1024 for benchmarks on
Everest.
C. Applications
BLASX intends to provide multiGPU performance boost
by directly replacing the CPU based alternatives. Libraries
such as MAGMA, h-PaRSEC and StarPU requires extensive
changes on the legacy code to comply with standards while
cuBLAS-XT requires commercial licenses without delivering
a proportional increase of performance. In this section, we
present a few applications of BLASX as follows:
a) Caffe [16] is one of the most popular deep learning
frameworks nowadays, in which BLAS computes the image
convolution [17], forwards and backwards passes of densely
connected layers [18]. We built a CPU version of Caffe and
changed the BLAS linkage to BLASX. An Artificial Neural
Network (ANN) of architecture, 3072→16384→16384→10,
was trained with CIFAR-10 dataset [19] to classfiy images in
10 categories. We used Caffe’s benchmark utilities to measure
the average elapsed time for a forward and backward passes
on the machine Makalu. The experiment data demonstrates
that BLASX accelerates ANN training up to 2.48 W.R.T Caffe
GPU training and 62.3 W.R.T Caffe CPU training. In terms
TABLE VI. The exemplary performance boost of MATLAB
SIMULINK libraries while using BLASX on Everest.
command Description Speedup
A ∗B matrix multiplication in single precision. 12.75x
A ∗B matrix multiplication in double precision. 8.27x
nnmf factorize the A into nonnegative factors W and H. 6.72x
rotatefactors rotates the A to maximize the varimax criterion. 5.83x
lsqlin solves the linear system in the least-squares sense 3.09x
of model parameters, the Caffe’s single GPU training can
accommodate up to 1.5 ∗ 109 parameters on a 12 GB GPU.
BLASX, however, enables a model of 3.2 ∗ 1010 parameters
on a multiGPU server with 256 GB host RAM. Please refer
to our poster [20] for more details1.
b) MATLAB, R and Octave are renowned technical com-
puting languages widely used in academia and industry. These
scientific languages offload the primitive matrix or vector
operations to the BLAS to ensure the performance. Users
can integrate BLASX with MATLAB by simply exporting
the environment variable BLAS VERSION to the location of
BLASX. Although MATLAB has the GPU computing toolbox,
it requires users to manually manage the data on the GPU yet
without multiGPU support. BLASX elegantly resolves these
issues with its underlying dynamic runtime. Table VI presents
a few exemplary instances of MATLAB’s SIMULINK libraries
while using BLASX. There are additional vast applications
of BLASX such as finding the shortest path in a graph,
the topology optimization [22] and finite element analysis in
structure mechanics [23].
VI. CONCLUSION
Existing L3 BLAS libraries such as PaRSEC, MAGMA,
Supermatrix, cuBLAS-XT suffers from issues such as back-
ward compatibility, insufficient communication/computation
overlapping, inefficient communication and poor scalability.
In this paper, we design and implement BLASX, a suite of
L3 BLAS, that delivers the best L3 BLAS performance on
heterogeneous multiGPU systems. We introduce a novel two
level hierarchical tile cache to reduce the global communica-
tion; Our locality aware scheduling runtime perfectly balances
the load across heterogeneous GPUs and CPUs. We optimized
communication/computation overlapping on streams to renders
trivial communication cost; thereby BLASX computes in a
out-of-core fashion that insures input and output data always
remains on the host RAM. Extensive benchmarks demon-
strate that BLASX consistently outperforms the leading indus-
trial and academia related projects in terms of performance,
scalability, and communication efficiency. More importantly,
BLASX requires the least effort from users to integrate with
the vast amount of legacy BLAS based applications.
REFERENCES
[1] Dongarra, Jack J., et al. ”A set of level 3 basic linear algebra
subprograms.” ACM Transactions on Mathematical Software (TOMS)
16.1 (1990): 1-17.
1The CPU activation of Caffe is a single thread function, which is not
efficient. We modified the function to be multithreaded to benchmark the
results.
[2] Chan, Ernie, et al. ”Supermatrix: a multithreaded runtime scheduling
system for algorithms-by-blocks.” Proceedings of the 13th ACM SIG-
PLAN Symposium on Principles and practice of parallel programming.
ACM, 2008.
[3] Tomasulo, Robert M. ”An efficient algorithm for exploiting multiple
arithmetic units.” IBM J. Res. Dev (1995): 13-21.
[4] Augonnet, Cdric, et al. ”StarPU: a unified platform for task scheduling
on heterogeneous multicore architectures.” Concurrency and Compu-
tation: Practice and Experience 23.2 (2011): 187-198.
[5] Blumofe, Robert D., and Charles E. Leiserson. ”Scheduling multi-
threaded computations by work stealing.” Journal of the ACM (JACM)
46.5 (1999): 720-748.
[6] Leung, Joseph Y-T., and Jennifer Whitehead. ”On the complexity of
fixed-priority scheduling of periodic, real-time tasks.” Performance
evaluation 2.4 (1982): 237-250.
[7] Nath, Rajib, et al. ”Optimizing symmetric dense matrix-vector mul-
tiplication on GPUs.” Proceedings of 2011 International Conference
for High Performance Computing, Networking, Storage and Analysis
(SC). ACM, 2011.
[8] Wu, Wei, et al. ”Hierarchical DAG Scheduling for Hybrid Distributed
Systems.” 29th IEEE International Parallel and Distributed Processing
Symposium. 2015.
[9] developer.nvidia.com/cublasxt
[10] Goto, Kazushige, and Robert Van De Geijn. ”High-performance imple-
mentation of the level-3 BLAS.” ACM Transactions on Mathematical
Software (TOMS) 35.1 (2008): 4.
[11] Schroeder, Tim C. ”Peer-to-peer and unified virtual addressing.” GPU
Technology Conference, NVIDIA. 2011.
[12] Kdzierski, Kamil, et al. ”Adapting cache partitioning algorithms to
pseudo-lru replacement policies.” Parallel & Distributed Processing
(IPDPS), 2010 IEEE International Symposium on. IEEE, 2010.
[13] Sweazey, Paul, and Alan Jay Smith. ”A class of compatible cache
consistency protocols and their support by the IEEE futurebus.”
ACM SIGARCH Computer Architecture News. Vol. 14. No. 2. IEEE
Computer Society Press, 1986.
[14] Song, Fengguang, Stanimire Tomov, and Jack Dongarra. ”Enabling and
scaling matrix computations on heterogeneous multi-core and multi-
GPU systems.” Proceedings of the 26th ACM international conference
on Supercomputing. ACM, 2012.
[15] Michael, Maged M., and Michael L. Scott. ”Simple, fast, and practical
non-blocking and blocking concurrent queue algorithms.” Proceedings
of the fifteenth annual ACM symposium on Principles of distributed
computing. ACM, 1996.
[16] Jia, Yangqing, et al. ”Caffe: Convolutional architecture for fast feature
embedding.” Proceedings of the ACM International Conference on
Multimedia. ACM, 2014.
[17] Chellapilla, Kumar, Sidd Puri, and Patrice Simard. ”High performance
convolutional neural networks for document processing.” Tenth Inter-
national Workshop on Frontiers in Handwriting Recognition. Suvisoft,
2006.
[18] Rumelhart, David E., Geoffrey E. Hinton, and Ronald J. Williams.
”Learning representations by back-propagating errors.” Cognitive mod-
eling 5 (1988): 3.
[19] Krizhevsky, Alex, and Geoffrey Hinton. ”Learning multiple layers of
features from tiny images.” (2009).
[20] Wang, Linnan, et al. ”Large Scale Artificial Neural Network Training
Using MultiGPUs” SC poster (2015).
[21] Seidel, Raimund. ”On the all-pairs-shortest-path problem in un-
weighted undirected graphs.” Journal of computer and system sciences
51.3 (1995): 400-403.
[22] Bendse, Martin P., and Ole Sigmund. ”Topology Optimization.”
(2009): 3928-3929.
[23] Szabo, Barna Aladar, and Ivo Babuka. Finite element analysis. John
Wiley & Sons, 1991.
