Efficient Matrix Factorization on Heterogeneous CPU-GPU Systems by Yu, Yuanhang et al.
ar
X
iv
:2
00
6.
15
98
0v
1 
 [c
s.D
C]
  2
4 J
un
 20
20
Efficient Matrix Factorization on Heterogeneous
CPU-GPU Systems
Yuanhang Yu♮, Dong Wen♮, Ying Zhang♮, Xiaoyang Wang§, Wenjie Zhang†, and Xuemin Lin†
♮AAII, University of Technology Sydney, Australia
§Zhejiang Gongshang University, China
†The University of New South Wales, Australia
♮yuanhang.yu@student.uts.edu.au; {dong.wen, ying.zhang}@uts.edu.au;
§
xiaoyangw@zjgsu.edu.cn;
†
{zhangw, lxue}@cse.unsw.edu.au;
Abstract—Matrix Factorization (MF) has been widely applied
in machine learning and data mining. A large number of
algorithms have been studied to factorize matrices. Among them,
stochastic gradient descent (SGD) is a commonly used method.
Heterogeneous systems with multi-core CPUs and GPUs have
become more and more promising recently due to the prevalence
of GPUs in general-purpose data-parallel applications. Due to the
large computational cost of MF, we aim to improve the efficiency
of SGD-based MF computation by utilizing the massive parallel
processing power of heterogeneous multiprocessors. The main
challenge in parallel SGD algorithms on heterogeneous CPU-
GPU systems lies in the granularity of the matrix division and
the strategy to assign tasks. We design a novel strategy to divide
the matrix into a set of blocks by considering two aspects. First,
we observe that the matrix should be divided nonuniformly, and
relatively large blocks should be assigned to GPUs to saturate
the computing power of GPUs. In addition to exploiting the
characteristics of hardware, the workloads assigned to two types
of hardware should be balanced. Aiming at the final division
strategy, we design a cost model tailored for our problem to
accurately estimate the performance of hardware on different
data sizes. A dynamic scheduling policy is also used to further
balance workloads in practice. Extensive experiments show that
our proposed algorithm achieves high efficiency with a high
quality of training quality.
I. INTRODUCTION
As a common technique in machine learning and data
mining, matrix factorization (MF) has been widely applied
in many areas, such as recommendation system [1], [2],
social network analysis [3], [4], web mining [5], word em-
bedding [6], and graph embedding [7].
Given a sparse matrix R with m rows and n columns, the
aim of MF is to decompose R into two dense matrices P (m×
k) and Q(k × n), which satisfies the following condition:
R ≈ P ×Q
Here, the dimension k is far less than m and n so that the
original sparse data can be represented well by new dense
data with low dimensionality.
Stochastic Gradient Descent. The existing literature has pro-
posed three main approaches to solve MF, namely, alternating
least squares (ALS), coordinate descent (CD), and stochastic
gradient descent (SGD). Among them, SGD-based algorithms
have received the most attention, due to their algorithmic
simplicity and effectiveness. SGD algorithms execute several
iterations to make the training result convergent. The number
3.0 3.0
5.0 4.5
5.03.0
1.05.05.0
1.44
2.25
2.32
1.49
0.92
0.23
1.28 0.65
0.02 1.33
0.91
1.83
2.00 0.272.10
1.08
R
Q
P
= ×
p
1
p
2
p
3
p
4
q
1
q
2
q
3 q4
Movies
C
u
s
to
m
e
rs
Fig. 1. A rating matrix R and a corresponding matrix factorization
of iterations can be a parameter specified by users. In each it-
eration, the gradient of every element in the sparse matrix R is
computed and the corresponding vectors in the result matrices
are updated. We focus on the SGD-based MF algorithms in
this paper and use MF to denote the SGD-based MF whenever
there is no ambiguity.
In the literature, several parallel SGD approaches have been
proposed on different system settings such as GPUs (e.g.,
CuMF_SGD [8]), multi-core CPUs (e.g., FPSGD [9]), and
distributed systems (e.g., NOMAD [10]). To parallelize the
computation in SGD, the main idea of these methods is to
uniformly divide the sparse matrix R into a set of blocks.
In each iteration, a set of mutually independent blocks are
assigned to different working units (e.g., threads, nodes, or
GPUs) and updated by these working units. Here, two blocks
are independent of each other if they do not share the same
row and the same column. This strategy avoids the writing
conflict in P and Q since the elements in the same row (resp.
column) of R will update the same row (resp. column) of
P (resp. Q). Several optimizations are also studied for their
specific system settings, and more related works of MF are
introduced in Section III.
Nowadays, GPU becomes very popular for running general-
purpose data-parallel applications. Consequently, heteroge-
neous systems with multi-core CPUs and GPUs become more
and more promising for many general tasks. The idea of
combing CPU and GPU not only speeds up the task execution
but also effectively utilizes the computing resources available
in heterogeneous systems. By contrast, to meet performance
requirements, the CPU-only or GPU-only systems are usually
over-provisioned. This leads that their average utilization re-
mains low. For example, after allocating the task to a GPU
(i.e., starting the kernel), the CPU stays idle. Similarly, for
applications where the GPU memory bandwidth is a major
bottleneck, the massive computation resources of GPUs re-
main underutilized. If the resources that were originally idle
in the system are utilized effectively, the running time of the
program will be reduced. This has motivated a significant
amount of research on heterogeneous computing techniques
[11]–[13]. To our best knowledge, there is no existing work
on the MF task on heterogeneous CPU-GPU systems, and it is
desirable to develop an efficient MF algorithm for this system
setting. In this paper, we mainly study the task division and
the scheduling strategy for SGD on heterogeneous CPU-GPU
systems. Our techniques do not closely depend on any specific
algorithm tailored for a single thread of the CPU or a GPU.
Due to the close relation between threads in a GPU, a
straightforward method to combine GPUs and CPUs is to
consider the GPU as a CPU thread. Then, we can naturally
adopt the state-of-the-art shared memory algorithm FPSGD
[9] in our problem. We call this method HSGD. Despite the
success of HSGD, there are still opportunities for further
improvement. First, based on [9], we derive a matrix division
rule such that the number of blocks in the matrix should be
at least (nc + ng + 1)× (nc + ng), where nc and ng are the
number of CPU cores and GPUs, respectively. We observe that
in HSGD, even with the smallest number of blocks, the size
of a block derived by the uniform division is not large enough
to saturate the processing power of GPUs (Observation 1).
This may limit the overall efficiency of HSGD. Second, the
computing power of a GPU is normally much stronger than
that of a single CPU thread. Given that blocks sharing the same
row or column cannot be processed in parallel, the update
frequency of blocks processed by GPU can be extremely
high due to the powerful computing power of GPU. This
phenomenon may lead to a weak training quality (Example 3).
Our Approach. To overcome the drawbacks in HSGD, we
propose a nonuniform matrix division strategy for the parallel
SGD algorithm. Specifically, we first design a novel cost model
to evaluate the performance of hardware. Existing works on
estimating the working efficiency on heterogeneous CPU-GPU
systems are neither available in our setting nor accurate enough
in the MF problem. Our cost model is tailored for the parallel
SGD algorithm and can accurately estimate the performance
of GPUs when the block is relatively small. Based on the cost
model, we can initially divide the matrix into two parts which
are assigned to GPUs and CPUs, respectively. Each part is
divided into a set of blocks to avoid conflicts in MF com-
putation. We also adopt a dynamic scheduling policy which
allows the dominant computation resource to use work-stealing
mechanism [14] and process blocks originally assigned to the
other resource. This optimization further balances workloads
by filling the gap between the estimation by our cost model
and the practical performance. When the dynamic scheduling
policy is activated, the sparse matrix R is divided into more
blocks for avoiding conflicts.
Contributions. We summarize main contributions as follows.
• A parallel SGD algorithm on heterogeneous CPU-GPU
systems. To our best knowledge, this is the first work to
efficiently tackle Matrix Factorization on heterogeneous
CPU-GPU systems.
• A nonuniform matrix division strategy to improve ef-
ficiency. Compared with a straightforward method to
uniformly divide the matrix, our division strategy fully
utilizes the strong processing power of GPUs.
• A novel cost model to estimate the performance of GPUs.
We balance workloads using a cost model specifically
designed for our problem. The cost model of the GPU
part considers both data transfer and kernel execution.
• Extensive experiments on four benchmark datasets. The
experiments demonstrate that our proposed framework
can effectively utilize the resources in the heterogeneous
CPU-GPU systems and show the superior performance
compared to the competitors.
Organization. The rest of the paper is organized as follows.
Section II introduces background information including MF
and GPU architecture. Section III reviews related works.
Section IV discusses the motivation and proposes an overall
framework. Section V proposes the details of our cost model.
Section VI introduces the dynamic scheduling strategy and
the final matrix division strategy. Section VII reports the
experimental results, and Section VIII concludes the paper.
II. PRELIMINARY
A. Matrix Factorization
We consider a user-item rating matrix R ∈ Rm×n where m
and n are the number of rows and the number of columns of
the matrix, respectively. For each 1 ≤ u ≤ m and 1 ≤ v ≤ n,
ru,v is the rating from the user u to the item v. R is normally
sparse since not every element in R is explicitly reported.
Matrix factorization aims to represent the matrix R as a dot
product between two dense matrices P ∈ Rm×k and Q ∈
R
k×n, where k is the number of latent factors. A mathematical
representation of the matrix factorization is shown as follows.
ru,v ≈ puqv (1)
In Equation 1, pu is the u-th row vector of P , and qv is
the v-th column vector of Q. Matrix factorization achieves
Equation 1 by minimizing the following loss function.
L =
∑
(u,v)∈R
(ru,v − puqv)2 + λP ‖pu‖2F + λQ ‖qv‖2F (2)
In Equation 2, (ru,v − puqv)2 measures the gap between
ru,v and estimated value puqv. λP ‖pu‖2F and λQ ‖qv‖2F are
used to avoid over-fitting. ‖.‖2F computes the Frobenius norm1
of a vector. λP and λQ are regularization coefficients.
EXAMPLE 1. We give an example of the matrix factorization
in Figure 1. The rating matrix R contains nine ratings for
four movies given by four customers. The number of latent
factors k is 2. We have r1,2 = 5.0 which means that u1 gives
a rating 5.0 to v2. The results of R’s matrix factorization are
shown on the right of R. P is a user preference matrix, and
Q is a movie feature matrix. The vector p1(0.23, 2.32) is the
1https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm
Algorithm 1: SGD
Input: Rm×n, k, λP , λQ, γ, t
Output: Pm×k, Qk×n
// Data preprocessing phase
1 init_model(Pm×k, Qk×n);
// Training phase
2 foreach iteration← 1 to t do
3 foreach ru,v ∈ Rm×n do
4 eu,v ← ru,v − puqv;
5 pu ← pu + γ(eu,vqTv − λPpu);
6 qv ← qv + γ(eu,vpTu − λQqv);
// Data post-processing phase
7 save_model(Pm×k, Qk×n);
preference of the user u1, and q2(1.33, 2.00)
T is the feature
of the movie v2. The estimated value of p1q2 is 4.9459, which
is close to R1,2 5.0.
B. Stochastic Gradient Descent for Matrix Factorization
It is time-consuming to calculate the loss of the whole
matrix R when using the loss function of Equation 2, espe-
cially when R contains billions of items. Several works have
been done to minimize Equation 2 and improve the efficiency
of matrix factorization [15]–[17]. In this paper, we follow a
prevalent algorithm called stochastic gradient descent (SGD)
[15] among them, and ideas of several other algorithms are
introduced in Section III.
SGD executes several iterations. The number of iterations
can be specified by users. Instead of straightforwardly applying
the gradient descent to minimize Equation 2 in each iteration,
SGD computes the gradient of every element ru,v in R
and updates the corresponding vectors in the result matrices.
Consequently, the original loss function in Equation 2 is
reduced to the following equation.
L = (ru,v − puqv)2 + λPpupTu + λQqTv qv (3)
The gradient of Equation 3 is represented as follows.
∂L
∂pu
= 2(puqv − ru,v)qTv + 2λPpu (4)
∂L
∂qv
= 2(puqv − ru,v)pTu + 2λQqv (5)
Based on the gradient (Equation 4 and Equation 5), we
train the model iteratively, and the value of the loss function
decreases until it is convergent. Equation 6 shows this process
where γ is the learning rate.
pu ← pu − γ
∂L
∂pu
qv ← qv − γ
∂L
∂qv
(6)
We present the pseudocode of SGD in Algorithm 1. There
are several input parameters. Rm×n is a sparse rating matrix
stored in the form of triadic tuple. k is the number of
latent factors. γ is the learning rate. λ is the regularization
parameter. t is the number of iterations. Algorithm 1 outputs
two feature matrices Pm×k and Qk×n. The algorithm consists
of three phases: data preprocessing, SGD training, and data
post-processing. The data preprocessing phase initializes two
resulting matrices P andQ with values generated randomly. In
the training phase, in each iteration (line 3-6), every element
is picked to decrease the value of loss function by using
Equation 6. The SGD training terminates when the given
number of iterations is reached (line 2) or the model converges.
The feature matrices are stored after training.
Problem Definition. In this paper, we aim to develop an
efficient SGD-based MF algorithm in heterogeneous CPU-
GPU systems.
REMARK 1. Our research mainly focuses on the scheduling
strategy for the task division and assignment between CPUs
and GPUs. Our proposed techniques will not closely depend
on any specific GPU or GPU SGD-based MF algorithms.
C. Characteristics of GPUs
We introduce several basic architecture characteristics of
GPUs. More details can be found in [18].
Execution Model. GPU manages thousands of threads in
a three-layer hierarchy, which includes grids, blocks, and
warps. A block contains dozens of warps, and multiple blocks
are organized into a grid. A warp is the smallest execution
unit from the view of hardware and contains 32 consecutive
threads. Instructions of a GPU program (also named kernel)
are executed in a way called SIMT (Single Instruction Multiple
Thread). SIMT means that all threads in a warp must execute
the same instruction on their own data at the same time, while
the state of threads in a warp can be different (active or
inactive) so that they can go through different execution paths.
Therefore, GPU threads are not as flexible as CPU threads.
Memory Hierarchy. GPU has a complicated memory hierar-
chy. We introduce several commonly used concepts. Global
memory is the largest memory, whose access speed is the
slowest. Data in global memory can be accessed by all threads.
Shared memory is on-chip and a relatively scarce resource. It
can be accessed by threads from the same thread block since
GPU promises that threads in the same block are performed
on the same processor. Register is used to store local variables
in kernel, whose access speed is the fastest. A register is only
directly accessed by the thread it belongs to. Threads can
access local variables of other threads in the same warp.
III. RELATED WORKS
In this section, we introduce the details of FPSGD [9] and
CuMF_SGD [8], which are related to our problem. Several
other related works are briefly summarized.
A. Multi Core MF Algorithms
There are several algorithms extending the ideas of SGD
in shared memory (multi CPU cores) systems. Hogwild [19]
adopts a lock-free update strategy to compute MF in parallel.
DSGD [20] performs a matrix-blocking partition to avoid
conflicts. Several variants of DSGD [9], [21], [22] also follow
this property. FPSGD [9] is the state-of-the-art in this type.
We introduce more details of FPSGD as follows.
p
1
p
2
p
3
p
4
q
1
q
2
q
3 q4
B1,1
B2,3
B3,4
B4,2
B1,2
Fig. 2. An example of independent and conflicting blocks in 4×4 grid matrix
FPSGD. FPSGD uniformly divides the rating matrix R into a
set of sub-matrices (called blocks). Two blocks are indepen-
dent of each other if they share neither any common column
nor any common row of the rating matrix. Otherwise, we say
they conflict. FPSGD applies the SGD algorithm to a set of
independent blocks simultaneously and repeats this step until
the number of processed blocks reaches a predefined value. In
each iteration, all selected blocks must be independent since
two blocks in the same row (resp. column) update the same
area of P (resp. Q) and cannot be processed in parallel.
EXAMPLE 2. We explain FPSGD in Figure 2. The rating
matrix R is divided into 4 × 4 blocks. The computation
on B1,1 updates vectors p1 and q1, and the computation
on B2,3 updates vectors p2 and q3. They update vectors in
different regions, which means that the computation on B1,1
and B2,3 can be performed at the same time. We can see that
B1,1, B2,3, B3,4 and B4,2 are mutually independent. However,
B1,1 and B1,2 are not independent of each other since they
both update vectors in p1, and this leads to a conflict.
B. GPU Algorithms
GPUSGD [23] proposes matrix blocking-based MF solution
on GPUs. BIDMach [24] supports SGD-based MF and uses
GPUs as accelerators. CuMF_SGD [8] is the state-of-the-art.
CUMF_SGD. CuMF_SGD designs a high-performance GPU
kernel by fully exploiting GPU hardware characteristics, which
includes warp shuffle, memory coalescing, register usage,
and half-precision. CuMF_SGD fully utilizes the CPU-GPU
memory transfer bandwidth by simultaneously performing
the memory transfer and the computation. Specifically, when
scheduling blocks, the algorithm randomly selects multiple
consecutive blocks at a time instead of only one independent
block for the GPU. The strategy reduces the overhead of
CPU-GPU memory transfer. CuMF_SGD uses CUDA streams
to achieve this strategy. A stream contains a list of GPU
commands that are executed in serial. Commands in different
streams are executed in parallel if hardware resources permit.
CuMF_SGD kernel uses three streams to manage the CPU-
GPU memory transfer, the GPU kernel, and the GPU-CPU
memory transfer, respectively.
C. Other Related Works
Other SGD-based MF Algorithms. MF has been extensively
investigated in the literature and a large number of algorithms
were proposed for this problem. In addition to algorithms
on multi-core and GPU architecture mentioned before, Par-
allel SGD solutions have been discussed in distributed sys-
tems [22] [10] and parameter server [25].
Other MF Algorithms. In addition to SGD-based algo-
rithms, other methods like Coordinate Descent (CD) [17]
and Alternate Least Square (ALS) [16] are also proposed to
compute MF. They use different update rules in each iteration.
Specifically, ALS [16] updates one of the result matrices once
by fixing the other. Then, it updates the other result matrix
in the same way. An iteration is completed when two result
matrices are both updated. CD [17] updates one element in a
result matrix once by fixing all other elements in two result
matrices. An iteration is completed when all elements in result
matrices are updated by the same update rule.
IV. OUR APPROACH
In this section, we first give a straightforward method and
show its drawbacks. Then, we briefly introduce scheduling
algorithms for heterogeneous systems and propose our method
to balance workloads. Finally, an overview of our improved
algorithm is provided.
A. A Straightforward Method
A straightforward idea to utilize both CPU and GPU re-
sources is to treat a GPU kernel as a worker thread. Based
on this idea, we can adapt FPSGD [9] by regarding a GPU
as an additional worker thread. Similarly, to avoid the conflict
and obtain good training quality, a worker thread receives a
new block satisfying the two criteria in Section III-A once
it finishes processing a block. We apply the FPSGD and
CuMF_SGD algorithms to process a matrix block on a CPU
thread and a GPU kernel, respectively. This straightforward
algorithm shown above can be called HSGD.
Let nc and ng be the number of CPU threads and GPUs.
Following the matrix-division rule in [9], the number of blocks
should be at least (nc + ng + 1)× (nc + ng + 1). We refine
this formula and give a more precise matrix-division rule.
RULE 1. Given an input rating matrix R, nc CPU threads,
and ng GPUs, R should be divided into at least (nc + ng +
1)× (nc + ng) blocks.
We explain the rationale of this rule. When the number of
blocks is less than (nc + ng + 1) × (nc + ng), every thread
is always assigned the same block. As a result, only several
specific blocks are updated during the algorithm. Obviously,
this will lead to a terrible training result. Even worse, the
algorithm cannot fully exploit all worker threads. For example,
in Figure 2, assume that there are 4 threads. The block B1,1 is
assigned to thread 1, and the rest gray blocks are assigned to
other threads. When thread 1 finishes processing B1,1, the rest
gray blocks may still be occupied. To avoid conflicts, thread 1
has to continually process B1,1. By contrast, when the block
number increases to (nc + ng + 1)× (nc + ng), thread 1 can
always locate a spare row or column which is not occupied
by other blocks.
B. Motivation
Even though the HSGD successfully combines CPU and
GPU resources, we have several observations which can help
(i) improve the working efficiency of GPUs and (ii) balance
workloads for different hardware.
40
80
120
500 1000 1500 2000 2500
u
pd
at
e 
sp
ee
d 
(m
illi
on
 po
int
s/s
)
block size (thousand points)
(a) GPU
2.5
5
7.5
100 200 300 400
u
pd
at
e 
sp
ee
d 
(m
illi
on
 po
int
s/s
)
block size (thousand points)
(b) CPU
Fig. 3. Processing speed of GPUs and CPUs on blocks with different sizes
Exploiting Hardware Characteristics. GPUs and CPUs have
different features. We have two observations as follows.
OBSERVATION 1. In the context of MF, small blocks cannot
saturate the GPU computing power.
To indicate the relationship between the block size and the
efficiency of the GPU, we launch a GPU kernel with the
default configuration to process blocks with different sizes.
The details of configuration and hardware can be found in
Section VII. The GPU throughput is reported in Figure 3(a),
where the labels on the x-axis represent the number of
elements in a block, and the labels on the y-axis represent
the average number of elements processed in every second.
The throughput significantly increases when the block size
is relatively small. Afterward, the upward trend becomes gen-
tle as the block size continues to grow. This phenomenon may
be due to two main reasons. First, we need to transfer data via
the PCI-e bus to the global memory of GPU when launching a
GPU kernel. A small block cannot fully utilize the bandwidth
of the bus. Second, more data can better utilize all threads and
the cache mechanism of the GPU. More throughput details
about data transfer and GPU kernel execution can support our
opinion and will be shown in Section V-B.
OBSERVATION 2. In the context of MF, the computing power
of CPU cores is not sensitive to the block size.
The average number of elements processed by a thread of
CPU in every second is shown in Figure 3(b). Unlike the GPU,
the throughput of a CPU thread always remains stable when
the block size varies. This is because the worker threads of
CPU are relatively more independent than those of GPU and
the computing capability of a CPU thread is not so powerful,
compared with the whole GPU device.
Nonuniform Matrix Division. Based on the above two
observations, an immediate idea to improve the algorithmic
efficiency is to set the block size as large as possible. However,
as mentioned in Section IV-A, the input matrix should be
divided into at least (nc + ng + 1) × (nc + ng) blocks and
the block size under this division strategy is still relatively
small. For example, we use 16 CPU threads and a GPU
in the default configuration of our experiments. Considering
the real-world dataset Yahoo!Music with 1, 000, 990 rows and
624, 961 columns, we divide it into at least 18 × 17 blocks.
Consequently, the number of elements in each block is less
than 1 million, which is still not large enough to saturate the
GPU computing power in the light of Figure 3(a).
To improve the working efficiency of GPUs, we divide the
rating matrix into blocks with different sizes. The large ones
…
1 2 3 4 number of updates
0 ∞
B2,2
B1,1
B3,3 B3,4
B1,1
B2,2
B1,1
B2,2
B1,1
B2,2
B3,3 B3,4
Fig. 4. A running example of the straightforward algorithm given 2 CPU
cores and 1 GPU
are assigned to GPUs, and the small ones are assigned to
CPUs. Towards this end, there are several issues that we need
to address. For example, we need to answer how to set suitable
block sizes for both CPU and GPU, and how to divide the
rating matrix in practice. We will answer these questions in
the following section, and the final matrix division strategy
will be given in Section VI.
Workload Balance. As shown above, we should make the size
of blocks assigned to GPUs as large as possible in terms of im-
proving the working efficiency of GPUs. However, an extreme
nonuniform division strategy may cause a serious workload
imbalance problem, which can remarkably reduce the overall
efficiency when combining two hardware resources. To prevent
this issue, a scheduling strategy should be considered. Many
efforts have been done to balance workloads for the appli-
cations in heterogeneous systems. They can be categorized
into three kinds: (1) dynamic methods, (2) static methods by
classifier, and (3) static methods by cost models. Here, we
briefly review some of representative methods among them
and elaborate reasons why they cannot be straightforwardly
used in our problem. Then, we propose our own method. For
more details of scheduling strategies in heterogeneous systems,
interested readers can refer to surveys [26], [27].
(1) Dynamic Methods. The dynamic methods assign a new
task to a computational device according to the performance
of devices on previous tasks [28]–[35]. For example, [28]
maintains a double-ended queue. CPU processes the tasks
from the front of the queue. Simultaneously, GPU processes
the tasks from the reverse direction. The algorithm naturally
enables the dominant hardware resource to handle more tasks.
The baseline algorithm in Section IV-A essentially adopts a
dynamic strategy. If a CPU core or a GPU finishes updating a
block, it is assigned a new block without incurring any conflict.
The dynamic method performs well in other problems if there
is not any rule to assign tasks.
However, this type of method does not work well in our
problem due to the independence property of the task assign-
ment. As discussed in Section IV-A, we first equally divide
the matrix into a set of blocks, which is necessary to avoid
conflicts. GPUs cannot achieve an ideal performance in this
setting according to Observation 1. In addition to the problem
of GPU working efficiency, the dynamic update method in
HSGD may suffer from a poor training result. We give an
example to explain this problem.
EXAMPLE 3. We consider computing the matrix factorization
of a rating matrix R on a machine with two CPU threads
c1, c2 and one GPU g1. Assume that we divide R into 3 × 4
matrix blocks, which is shown in Figure 4. In the beginning,
two CPU threads c1, c2 and the GPU g1 get blocks B1,1,
An input rating matrix
ModelingCPU Cost Model GPU Cost Model
Partition workload Assign blocks by scheduler
CPU threads
GPU kernels
Offline preprocessing
Outputs!
Fig. 5. Overview of HSGD*
B2,2, and B3,3, respectively. Normally, the computing power
of a GPU is much stronger than that of a CPU thread. As
a result, when g1 has completed its task B3,3, c1 and c2 are
still working on their blocks B1,1 and B2,2. According to the
scheduling policy of HSGD, g1 will apply a new block which
is independent of B1,1 and B2,2 and has the least number
of updates. Block B3,4 satisfies these conditions. g1 picks
B3,4 in the second step of Figure 4. In the following steps,
g1 will continually update the two blocks in the lower right
corner since B1,1 and B2,2 are always occupied by c1 and c2.
This phenomenon makes the numbers of updates for different
blocks severely unbalanced, which is demonstrated in the last
matrix. The number of updates for a block is relatively large if
the corresponding color is dark. Compared with the original
SGD algorithm which updates matrix elements randomly, the
process in this example leads to a weak training result.
(2) Static Methods by Classifier. Static methods provide
scheduling decisions for worker threads of different devices
before applications start. This type of method establishes
a classifier based on training datasets in the offline phase
[36]–[39]. Given a new task, the classifier identifies the
class of the task and applies a corresponding strategy of task
assignment derived from the offline phase.
There are several drawbacks if applying the classifier-based
methods in our problem. First, it is complicated and time-
consuming to generate training data including static code fea-
tures and optimal partition strategies. Moreover, these methods
usually rely on specific frameworks such as OpenCL [40] and
Insieme [41] to obtain static code features. This makes them
not general. Second, they are usually designed for multi-task
platforms. Consequently, The partition scheme generated by
them is coarse-grained.
(3) Static Methods By Cost Model. This type of method esti-
mates CPU’s and GPU’s execution time by establishing a cost
model [11], [42]. A cost model is a function revealing the
relationship between the input data size and the corresponding
execution time of a specific worker thread. A representative
approach among them is Qilin [11]. It divides a training dataset
N into two parts N1 and N2, which are assigned to CPUs
and GPUs, respectively. N1 is further divided into m subparts
N1,1...N1,m. Each subpart N1,i is processed by a CPU thread,
and the corresponding execution time is recorded. A similar
operation is applied to N2 by GPUs. Qilin uses the curve
fitting to construct two linear equations as the projections of
the execution times for CPUs and GPUs respectively.
In the context of our problem, a simple linear function in
[11] is hard to accurately estimate the execution time of GPUs.
Recall Figure 3(a), it is not a horizontal line. This proves that
the execution time does not increase linearly as the number of
Algorithm 2: HSGD*
Input: Rm×n, k, λP , λQ, γ, t, nc, ng
Output: Pm×k, Qk×n
// Offline Phase
1 generate cost models of both CPUs and GPUs;
// Online Query Processing
2 partition Rm×n according to the cost models;
3 preprocess data;
4 init_scheduler(Rm×n, nc, ng);
5 scheduler assigns blocks to CPU threads and GPUs;
6 return Pm×k, Qk×n;
elements in a block grows.
Our Method to Balance Workloads. We propose a hybrid
method for our problem. We first customize a cost model to
divide the matrix specialized for the MF problem. We improve
the accuracy of the GPU cost model. The details of cost
models are given in Section V. Second, we have a dynamic
scheduling mechanism, which allows the dominant resource
to use work-stealing mechanism. This alleviates the deviation
between the cost model and the practical performance.
C. The Framework
In this section, we give an overview of our improved
algorithm in Figure 5, which is called HSGD*. Our method
contains an offline preprocessing phase and an online process-
ing phase. The offline phase (the gray area in Figure 5) derives
a cost model which estimates the hardware performance. This
step can be performed only once on a machine, and the
corresponding parameters are stored to support the query of
any input rating matrix in the online phase.
In the online phase, a sparse rating matrix R is given.
HSGD* first divides the matrix into two parts, denoted as
R1 and R2, based on cost models of CPUs and GPUs from
the offline phase. Then, R1 and R2 are further divided into
several blocks. Finally, the scheduler assigns blocks to worker
threads. The calculation process continues until the number
of iterations reaches the predefined value. During most of the
period when HSGD* runs, CPU threads are only allowed to
process blocks in R1, and GPUs are only allowed to process
blocks in R2. Similar to HSGD, the block assignment avoids
conflicts in the same row or column. We also have a dynamic
scheduling strategy to balance workloads in practice, which
is not reflected in Figure 5. The details will be shown in
Section VI. A formal pseudocode of the framework HSGD*
is reported in Algorithm 2, which is self-explanatory.
V. OUR COST MODEL
Given an input matrix R, let α and 1−α be the proportion
of the workload assigned to GPUs and CPUs, respectively,
where 0 ≤ α ≤ 1. We use Tg(α · R) and Tc((1 − α) · R)
to denote the time spent on updating elements in α · R and
(1−α)·R by a GPU and a CPU thread, respectively. When the
context is clear, Tg(α) and Tc(1− α) are used for short. The
total running time T of our algorithm is represented below.
T = max(
Tg(α)
ng
,
Tc(1− α)
nc
) (7)
Algorithm 3: Cost Estimation
Input: Rm×n
Output: fc and fg
// S is an array of segments
// Pc and Pg are arrays of structures
// Data preprocessing phase
1 S = sample_dataset(Rm×n);
// Training cost models
2 Pc = test_cpu_kernel(S);
3 fc = cpu_model_fitting(Pc);
4 f transferg = test_transfer_speed();
5 Pg = test_gpu_kernel(S);
6 fexecuteg = gpu_model_fitting(Pg);
7 fg = combine(f
transfer
g , f
execute
g );
Note that both Tg(α) and Tc(1− α) are monotonic. Based
on the computed cost functions, the total running time is
minimized when the load between resources keeps balancing.
We set α using the following equation.
α = argmin|Tg(α)
ng
− Tc(1− α)
nc
| (8)
Based on discussion above, our aim is to derive cost
functions fg(α) ≃ Tg(α) and fc(α) ≃ Tc(α) for a GPU
and a CPU thread, respectively, where fg(α) (resp. fc(α))
denotes the estimation of Tg(α) (resp. Tc(α)). As discussed
earlier, a straightforward method to establish a cost model is
to follow the works [11], [42] which think that execution time
is linearly related to the size of the input matrix. However,
based on Observation 1, we find that the processing speed of
GPUs increases when the block size increases in our problem.
This makes linear regression methods for the GPU cost model
inaccurate. Moreover, the computation in execution kernels of
GPUs and the data transfer are not completely serial due to
CUDA stream mechanism. The total running time of GPU is
not a simple sum of the kernel execution and the data transfer.
This observation makes us reconsider how execution kernel
and data transfer exactly influence the total running time of
GPUs, which is not discussed in previous cost models.
In the rest of this section, we introduce the strategy to
prepare the training data in Section V-A and propose our cost
model of GPUs in Section V-B. For the cost model of CPUs,
we use a linear function to estimate the performance similar
to [11]. A formal pseudocode to compute cost models is given
in Algorithm 3. We explain each step as follows.
A. Data Preparation and Training for CPUs
To derive the training data, we shuffle the input dataset
to avoid uneven data distribution. After the data is shuf-
fled, we equally divide input dataset into N disjoint parts
S1, S2, S3...SN , stored in array S at line 1 of Algorithm 3.
Then, CPU execution kernel configured with a single thread is
launched to compute on datasets generated in the last step at
line 2. Instead of computing on S1, S2, S3...SN respectively
in [11], CPU execution kernel computes on S1, S1+S2, S1 +
S2 + S3...S1 + S2 + S3 + ... + SN respectively, and the
corresponding execution time is recorded. After this, we get
 2.5
 5
 7.5
 10
 12.5
64KB 1MB 16MB 256MB
tr
an
sf
er
 sp
ee
d 
(G
B/
s)
data size
(a) CPU to GPU
 2.5
 5
 7.5
 10
 12.5
64KB 1MB 16MB 256MB
tr
an
sf
er
 sp
ee
d 
(G
B/
s)
data size
(b) GPU to CPU
Fig. 6. Transfer speed varies with block size
an array Pc including data size and corresponding execution
time. As a training data set, this array is used to curve fitting
for CPUs. Our adaptation generates a wider range of training
data which can better reflect the relationship between data size
and execution time. To eliminate noise, the execution time in
the training data is derived from the average of multiple tests.
B. Estimating Working Efficiency of GPUs
Given a task, the total processing time by a GPU is spent
on two parts: (1) data transfer between the CPU and the GPU
via the PCI-e bus, and (2) GPU execution kernels.
Data Transfer. Data transfer has two directions — from CPU
to GPU (sometimes called Host to Device) and from GPU
to CPU. We denote the times spent on them by f c⇒gg and
fg⇒cg , respectively. We only discuss the process to model the
data transfer from CPU to GPU, and the model for the other
direction is similar.
Figure 6 reveals the transfer speed for data with different
sizes. The transfer speed grows very fast in the beginning.
After the data size is larger than a threshold, the transfer speed
remains stable. Based on this phenomenon, we use a function
to fit the curve when the dataset is not very large, then use
the linear regression to model the rest. A formal model is
expressed as follows. |R| denotes the size of data transferred
from CPU to GPU, and τ denotes the threshold.
f c⇒gg =


|R|
a ·
√
log|R|+ b if |R| ≤ τ ;
a · |R|+ b otherwise.
The rationale for |R| ≤ τ is that the data transfer time
can be represented by the quotient of the data size and the
transfer speed. According to Figure 6, we use the function
a ·
√
log|R| + b to model the curve of the first stage. We
select this function since the trend performs like an inverse
function of the parabola. Note that the label distribution on the
x-axis is not linear. This is because the logarithmic scale can
make the trend clearly presented even though the data size is
small. Obviously, this phenomenon shows that we cannot fully
utilize the bandwidth of the PCI bus if the data is not large
enough, which supports Observation 1. Then, we determine
the threshold τ by the extent to transfer speed variation.
Empirically, when the variation of the transfer speed is less
than 2% in a time unit, we consider that the transfer speed
has been stable. Finally, we fit the curve as a linear function
for the second stage when |R| > τ . The curve fitting can be
done by using the least squares method.
GPU Execution Kernel. We design the cost model of the
GPU execution kernel. Similar to Figure 6, the throughput
40
80
120
500 1000 1500 2000 2500
u
pd
at
e 
sp
ee
d 
(m
illi
on
 po
int
s/s
)
block size (thousand points)
Fig. 7. Kernel execution time by varying data size
of updating remains stable after the block size reaches a
threshold, which means that the computing power of the
GPU is saturated. To fit the curves, we use a logarithmic
function when the dataset is not very large, and then use the
linear regression to model the rest. The growth trend of the
logarithmic function can be slower than the power function
(e.g.,
√
x), which is more consistent with the trend in Figure 7.
This is why we choose it to model the curve of the first stage.
A formal model is expressed as follows, where fkernelg denotes
the time spent onR by the GPU execution kernel. The function
a · log|R| + b represents the processing speed of the GPU
execution kernel.
fkernelg =


|R|
a · log|R|+ b if |R| ≤ τ ;
a · |R|+ b Otherwise.
Overall GPU Cost Model. We cannot simply sum the time
of the kernel execution and the data transfer as our estimation
for the overall execution time of the GPU, since these two
parts are not absolutely serial. Specifically, to improve the
overall working efficiency of GPUs, we adopt a widely used
optimization based on the CUDA stream mechanism. A CUDA
stream contains a list of GPU commands executed in serial,
and commands in different streams are executed in parallel
if hardware resources permit. At the same time, commands
in different streams can be synchronized. This mechanism
allows us to perform data transfer and kernel execution in
parallel without breaking correctness. We explain this idea in
the following example.
Process B
Stream 1: CPU to GPU
Stream 2: Execution Kernel
Stream 3: GPU to CPU
Process B’
B P Q B’ P’ Q’
P Q P’
…
…
…B’’ P’’ Q’’
Q’
Fig. 8. Data transfer optimization
EXAMPLE 4. As shown in Figure 8, we use three streams to
manage the data transfer from CPU to GPU, the kernel exe-
cution, and the data transfer from GPU to CPU, respectively.
Assume that a block B is assigned to the GPU. Stream 1
transfers B and corresponding P,Q to the global memory of
a GPU. Then, the GPU kernel scans the block B and updates
P,Q. Simultaneously, stream 1 continuously transfers the next
block B′ assigned to the GPU and corresponding P ′, Q′.
When stream 2 finishes B, stream 3 transfers the updated P,Q
back to CPU.
From this example, the overall time for GPU fg can be
roughly decided by the maximum time spent among these
three streams because it covers the time of the other two
…
CPU Blocks
GPU Blocks
α
1
−
α
nc + ng rows
ng rows
nc + ng rows in dynamic scheduling
nc + 2 x ng + 1 columns
Fig. 9. The final division strategy
parts. Note that although the first and the last schemes cannot
be overlapped by the maximal stream in the figure, the cost
can be ignored when the number of transferred and computed
blocks is very large. Note that fg⇒cg is always smaller than
f c⇒gg since we do not need to transfer blocks back to CPU.
Therefore, we define the overall cost model of a GPU as
follows.
fg = max(f
c⇒g
g , f
kernel
g ) (9)
The overall cost model of a GPU depends on the maximum
between data transfer time from CPU and GPU and execution
time of the GPU kernel.
VI. WORKLOAD BALANCE IN PRACTICE
A. Dynamic Scheduling
Even though we have proposed a tailored GPU cost model
for MF, the estimation may be still hard to exactly reflect
the computing power of devices given a different dataset.
The workloads of CPU and GPU may be unbalanced if we
assign blocks simply according to the cost model. To remedy
this issue, we adopt a dynamic scheduling strategy when one
device finishes its tasks. Specifically, assume that the GPU
has finished its tasks. Instead of waiting for the tasks being
processed by CPUs, we allow GPUs to pick some blocks
originally assigned to CPUs. We call it static phase when the
GPU and the CPU only process the originally assigned tasks
and call it dynamic phase when one of them finishes its own
tasks and is involved in processing tasks of the other. In static
phase, every GPU is assigned to blocks in specific rows so
that it can update one segment of one result matrix all the
time and avoid the transfer of this segment.
B. Putting Things Together
We explain our final strategy for the matrix division in this
section. An example is shown in Figure 9.
Number of Columns. Based on the cost model proposed in
Section V, we first partition the matrix into two sub-matrices
Rc and Rg for CPUs and GPUs, respectively. They are marked
by white and gray in Figure 9. In the light of Rule 1, we
further divide the matrix into nc + 2× ng + 1 columns. This
setting guarantees two things. The first thing is that GPUs
can always know not only the current block but also the next
block. This enables the overlap between computation and data
transfer in Figure 8. The second thing is that there always
exists a spare column when a GPU kernel or a CPU thread
finishes processing its block.
Number of Rows for CPUs. As shown in Rule 1, we can
divide the input matrix into nc + ng rows, where there are
nc (resp. ng) rows in Rc (resp. Rg). However, this division
strategy causes a problem when the dynamic scheduling is
activated. Specifically, assume that GPUs have finished their
own tasks and are involved in processing blocks of CPUs.
Currently, we have totally nc + ng (CPU and GPU) threads
working on (nc +2× ng + 1)× nc blocks. This would break
Rule 1 and cannot fully exploit all worker threads. To support
the assistance from GPUs, we set the number of rows of CPUs
as nc + ng based on Rule 1. The setting would not affect the
CPU efficiency since the computing power of CPUs is not
sensitive to the block size as shown in Observation 2.
Number of Rows for GPUs. Similarly, If CPUs first finish
their tasks and apply for blocks in Rg , the number of rows
in Rg should be at least nc + ng . However, compared with
the row number ng for GPUs, the row number nc + ng leads
to a smaller block size, which cannot saturate the computing
power of GPUs according to Observation 1. Different from the
case of CPUs, the division strategy for GPUs needs to satisfy
that the block size is large enough in the beginning, while the
number of rows is large enough to avoid conflicts if CPUs
join. To achieve this target, we divide Rg into ng rows. For
each row Rig where 1 ≤ i ≤ ng , we further divide Rig into
⌈ng+nc
ng
⌉ sub-rows. As a result, relatively large blocks with
sizes
Rg
(nc+2×ng+1)×ng
are assigned to GPUs in static phase,
and blocks with sizes
Rg
(nc+2×ng+1)×ng×⌈
ng+nc
ng
⌉
are assigned
to GPUs and CPUs in dynamic phase.
EXAMPLE 5. We give an example to explain the division
strategy for Rg . Assume that we have 2 GPUs and 4 CPU
threads, i.e., nc = 4, ng = 2. We divide Rg into 2 rows, and
each row is further divide into 3 sub-rows. In static phase, we
assign a block with size
Rg
9×2 to a GPU, and in dynamic phase,
we assign a block with size
Rg
9×6 to a GPU or a CPU thread.
On the other hand, Rc is divided into 9 columns and 6 rows.
This division for Rc would not change in the algorithm.
VII. EXPERIMENTS
We conduct extensive experiments to show the efficiency
and the effectiveness of our proposed algorithms. Algorithms
appearing in our experiments are summarized as follows.
• CPU-Only: Only CPU works. We uniformly divide the
matrix and use the strategy in [9] to assign blocks. More
details can be found in Section III-A. We use AVX and
OpenMP for acceleration.
• GPU-Only: Only GPU works. We vary the number of
rows and columns for the matrix division and adopt the
best one. The "-O3" optimization flag is supported.
• HSGD: CPU and GPU work in parallel. The algorithm
is introduced in Section IV-A. AVX, OpenMP, and "-O3"
optimization flag are supported.
• HSGD*: CPU and GPU work in parallel. Nonuniform
matrix division and dynamic strategy are used. Our cost
model decides the size of blocks assigned to two hard-
ware resources. AVX, OpenMP, and "-O3" optimization
flag are supported.
Stochastic gradient methods and the parameter k used in [9]
and [8] are different. To correctly combine two methods on
TABLE I
NETWORK STATISTICS AND PARAMETER SETTINGS
Datasets MovieLens Netflix R1 Yahoo!Music
m 71567 2649429 1948883 1000990
n 65133 17770 1101750 624961
#Training 9301274 99072112 104215016 252800275
#Test 698780 1408395 11364422 4003960
k 128 128 128 128
λP 0.05 0.05 1 1
λQ 0.05 0.05 1 1
γ 0.005 0.005 0.005 0.01
Heterogeneous CPU-GPU Systems, we embed the core part
of LIBMF2 and CuMF_SGD3 into our code and make minor
modifications to make the stochastic gradient methods they
use consistent. For stochastic gradient method, We choose to
use the more concise one in [9]. For the value of parameter k,
we choose the larger one in [8] because a large k value can
lead to a better training result.
Datasets and Parameter Setting. We evaluate algorithms in
four real-world datasets — MovieLens4, Netflix5, R16, and Ya-
hoo!Music7. Statistics of the datasets are presented in Table I.
For reproducibility, we consider the original training/test sets
in our experiments. More details about each dataset can be
found on the corresponding website. We set the parameters
following [43], which are also listed in Table I.
Experimental Environment. We use a machine with Intel
Xeon E5-2687W v3 3.10GHz processors and a Quadro P4000
GPU with 8GB global memory. The number of available cores
is 20. The system interface of the GPU is PCI Express 3.0×16.
The total bandwidth is 32GB/s. By default, we use 16 CPU
threads and 128 GPU parallel workers. Here, we follow the
definition of GPU parallel workers in [8], which means the
number of elements computed simultaneously in the GPU
kernel. All datasets can fit in memory in our experiments.
Organization. Section VII-A shows the adaptiveness of our
algorithms by varying the computing resources. Section VII-B
shows the effectiveness of our algorithm compared with the
state-or-the-art competitor. Section VII-C and Section VII-D
evaluate our optimization techniques including matrix division
strategy and workload balance.
A. Overall Efficiency
We evaluate the overall efficiency of our final algorithm
HSGD* with CPU-Only and GPU-Only as comparisons. We
use Root Mean Square Error (RMSE) 8 as a metric for the
loss, which is widely used in many recommender systems.
For each dataset, we terminate all algorithms and record
the corresponding running time when the RMSE reaches a
predefined value. Given that we use a different stochastic
gradient method from [8] and a different k value from [9],
the predefined loss values they used are not available. For fair
comparison, we select these values that can be reached by all
2https://github.com/cjlin1/libmf
3https://github.com/cuMF/cumf_sgd
4http://grouplens.org/datasets/movielens/
5https://www.kaggle.com/netflix-inc/netflix-prize-data
6https://webscope.sandbox.yahoo.com/catalog.php?datatype=r
7https://webscope.sandbox.yahoo.com/catalog.php?datatype=c
8https://en.wikipedia.org/wiki/Root-mean-square_deviation
CPU-Only GPU-Only HSGD*
 4
 8
 12
 16
 32 64  128  256  512
R
un
ni
ng
 T
im
e 
(s)
GPU parallel workers
(a) MovieLens
 8
 16
 24
 32
 32 64  128  256  512
R
un
ni
ng
 T
im
e 
(s)
GPU parallel workers
(b) Netflix
 10
 20
 30
 40
 50
 60
 70
 80
 32 64  128  256  512
R
un
ni
ng
 T
im
e 
(s)
GPU parallel workers
(c) R1
 80
 120
 160
 200
 32 64  128  256  512
R
un
ni
ng
 T
im
e 
(s)
GPU parallel workers
(d) Yahoo!Music
Fig. 10. Varying GPU Threads
methods including HSGD which suffers from a weak training
quality. The predefined loss values are 0.66, 0.82, 20, and 19
for MovieLens, Netflix, R1, and Yahoo!Music, respectively.
The comparisons between HSGD and HSGD* will be shown
in Section VII-C.
1) Varying GPU parallel workers: In this experiment, we
evaluate the adaptiveness of our algorithm by varying the
GPU parallel workers from 32 to 512. The running times of
algorithms for different GPU parallel workers are reported in
Figure 10 for four datasets. Note that the CPU thread number
is fixed to the default value 16.
As a reference, the running time of CPU-Only is stable on
all settings. Initially, the GPU-Only is slower than CPU-Only.
When we use more GPU threads, the running time of GPU-
Only decreases and overtakes that of CPU-Only. The running
time of HSGD* is the smallest among all algorithms. For ex-
ample, in R1, given 32 GPU worker threads, HSGD* takes 30
seconds while CPU-Only and GPU-Only take 33 seconds and
170 seconds respectively. When the thread number increases
to 512, HSGD* takes 14 seconds while GPU-Only takes 23
seconds. The decrease of the time of HSGD* also shows that
our algorithm is adaptive to different GPU settings and can
fully utilize the increasing computing power of GPUs.
2) Varying CPU Thread Number: In this experiment, we
evaluate the adaptiveness of our algorithm by varying the CPU
thread number from 4 to 16. The GPU parallel workers are
fixed to the default value 128. The running times of algorithms
for different CPU thread numbers are reported in Figure 11.
In contrast to Figure 10, the running time of GPU-Only is
consistent, and the running time of CPU-Only decreases when
we use more CPU threads. HSGD* is the fast algorithm on all
settings and all datasets. For example, in R1 when the CPU
thread number is 4, HSGD* takes 29 seconds, while CPU-
Only takes 109 seconds, and GPU-Only takes 48 seconds.
When the CPU thread number increases to 16, HSGD* takes
CPU-Only GPU-Only HSGD*
 4
 8
 16
 4  6  8  10  12  14  16
R
un
ni
ng
 T
im
e 
(s)
CPU Thread Number
(a) MovieLens
 20
 30
 40
 50
 4  6  8  10  12  14  16
R
un
ni
ng
 T
im
e 
(s)
CPU Thread Number
(b) Netflix
 10
 20
 30
 40
 50
 4  6  8  10  12  14  16
R
un
ni
ng
 T
im
e 
(s)
CPU Thread Number
(c) R1
 100
 140
 180
 220
 4  6  8  10  12  14  16
R
un
ni
ng
 T
im
e 
(s)
CPU Thread Number
(d) Yahoo!Music
Fig. 11. Varying CPU Threads
only 20 seconds while CPU-Only takes 33 seconds. The
decrease of the time of HSGD* shows that our algorithm is
adaptive to different CPU settings and can fully utilize the
increasing computing power of CPUs.
Figure 11 and Figure 10 show the high efficiency of
HSGD*. When the gap between the computing power of CPU
and GPU is limited (default setting), HSGD* achieves a 1.4-
2.3x speedup over CPU-Only and a 1.4-2.3x speedup over
GPU-Only on all datasets. The experiments also show that
the overhead cost of HSGD* is minor. When the gap between
the computing power of CPU and GPU is large, e.g., CPU uses
16 threads and GPU uses 512 parallel workers in Figure 10,
HSGD* still achieves a slight speedup over GPU-Only.
B. Training Quality
In this experiment, we report the derived loss values
(RMSE) of our algorithm HSGD* during the training process
to show the effectiveness of our method. The experiment will
demonstrate the loss value of our algorithm finally converges
to a reasonable value. CPU-Only and GPU-Only are also
compared as references.
The results are shown in Figure 12. The downward trends
of HSGD* are obvious, and the loss of HSGD* converges
in the shortest time. In addition, HSGD* achieves a similar
converged loss value compared with other algorithms, which
shows the effectiveness of our algorithm. For example, in
Yahoo!Music, the loss of HSGD* drops to 23 in 10 seconds.
At the same time, the loss values of CPU-Only and GPU-
Only are 25.2 and 25, respectively. When time increases to 25
seconds, the loss of HSGD*, CPU-Only, and GPU-Only drops
to 20, 22.5, and 22.3, respectively. Finally, all loss values of
HSGD*, CPU-Only, and GPU-Only stay about 19.
C. Matrix Division Strategy
We evaluate the effectiveness of our matrix division strategy
in this experiment. For each dataset, we record the loss
CPU-Only GPU-Only HSGD*
 0.6
 0.7
 0.8
 0.9
 1
 0  1.5  3  4.5  6
Te
st
 R
M
SE
Running Time (s)
(a) MovieLens
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 0  10  20  30  40
Te
st
 R
M
SE
Running Time (s)
(b) Netflix
 14
 18
 22
 26
 30
 0  25  50  75  100
Te
st
 R
M
SE
Running Time (s)
(c) R1
 18
 20
 22
 24
 26
 28
 30
 0  25  50  75  100
Te
st
 R
M
SE
Running Time (s)
(d) Yahoo!Music
Fig. 12. Test RMSE over training time on four datasets
(RMSE) value on different running times of HSGD* with
HSGD as a comparison. The result is shown in Figure 13.
We can see that the training quality of HSGD is poor
especially when processing relatively large datasets. This
phenomenon is consistent with the discussion in Example 3.
The nonuniform matrix division in HSGD* fixes this issue.
Given the same running time, HSGD* derives a smaller loss
value than HSGD, and the advantage of HSGD* is obvious
especially in large datasets. For example, given 50 seconds in
R1, the RMSE value for HSGD* reaches to 17, while that for
HSGD is only 21. The result proves that nonuniform matrix
division can utilize GPU resources better.
TABLE II
COMPARISON OF COST MODELS
Datasets MovieLens Netflix R1 Yahoo!Music
Workload proportion
HSGD*-Q
C 49.56% 55.98% 56.07% 56.46%
G 50.44% 44.02% 43.93% 43.54%
HSGD*-M
C 55.91% 49.02% 49.75% 53.61%
G 44.09% 50.98% 50.25% 46.39%
Running time
HSGD*-Q 0.92 s 15.87 s 13.07 s 40.88 s
HSGD*-M 0.89 s 13.02 s 12.08 s 35.41 s
D. Workload Balance
We evaluate the effectiveness of techniques used to balance
workloads in this section.
1) Cost Models: To show the effectiveness of our cost
models, we report the proportion of workloads derived by our
cost models with [11] as a comparison in Table II. To clearly
reflect the algorithmic efficiency based on two cost models, we
make these two methods run the same number of iterations,
which is 20 in this experiment.
In Table II, HSGD*-Q represents the algorithm HSGD*
which uses Qilin [11] to evaluate the working efficiency of
hardware. HSGD*-M represents the algorithm HSGD* which
uses our model in Section V to evaluate the working efficiency
HSGD HSGD*
 0.6
 0.7
 0.8
 0.9
 1
 0  1.5  3  4.5  6
Te
st
 R
M
SE
Running Time (s)
(a) MovieLens
 0.75
 0.8
 0.85
 0.9
 0.95
 1
 0  10  20  30  40
Te
st
 R
M
SE
Running Time (s)
(b) Netflix
 14
 18
 22
 26
 30
 0  25  50  75
Te
st
 R
M
SE
Running Time (s)
(c) R1
 18
 20
 22
 24
 26
 28
 30
 0  25  50  75  100
Te
st
 R
M
SE
Running Time (s)
(d) Yahoo!Music
Fig. 13. Test RMSE over training time
of hardware. Note that for fairness, both HSGD*-Q and
HSGD*-M do not include the dynamic scheduling strategy
in Section VI to further balance workloads. "C" and "G" in
the table represent the assigned proportion of workloads to
CPUs and GPUs, respectively.
The practical running times of HSGD*-Q and HSGD*-M
are also reported. The running time of HSGD*-M is smaller
than that of HSGD*-Q on all datasets. This result proves
that our cost model can derive a more accurate estimation
for the working efficiency of hardware. We can find that
HSGD*-M prefers to assign more work to GPU compared with
HSGD*-Q on all datasets except MovieLens. For MovieLens,
HSGD*-M observes that the performance of GPU is not strong
when processing a small dataset (Observation 1). Therefore,
it assigns less work to GPU. The effectiveness of HSGD*-
M becomes considerable when the dataset is large. Note that
given a smaller target loss, both algorithms will require more
iterations, and the advantage of our method will become
obvious. For example, to achieve the predefined loss value of
Yahoo!Music in Section VII-A, HSGD* needs 46 iterations,
which is more than twice that of this experiment.
TABLE III
EFFECTIVENESS OF DYNAMIC SCHEDULING
Dataset HSGD*-M HSGD*
MovieLens 0.89 s 0.84 s
Netflix 13.02 s 11.42 s
R1 12.08 s 10.58 s
Yahoo!Music 35.41 s 30.96 s
2) Dynamic Scheduling: We evaluate the effectiveness of
the dynamic scheduling strategy (Section VI). Similar to the
experiment for cost models, we use HSGD*-M to denote our
final algorithm without the dynamic scheduling technique to
further balance workloads. The running times of HSGD*-M
and HSGD* on all datasets are shown in Table III.
The result shows HSGD* is faster than HSGD*-M on all
datasets. Note that in MovieLens with relatively small size,
the computing power of GPU cannot be saturated, which
degrades the effectiveness of the dynamic scheduling. As a
result, the improvement of dynamic scheduling on MovieLens
is minor. By combining the new cost models and the dynamic
scheduling strategy, our final algorithm achieves a significant
improvement in balancing workloads of MF.
VIII. CONCLUSION
We propose an efficient parallel MF algorithm on hetero-
geneous CPU-GPU systems. To fully utilizes the processing
power of GPUs, our approach divides the input matrix using
a nonuniform method and assigns large blocks to GPUs. We
design a new cost model to estimate the performance of
computing resources. We balance workloads by combining the
cost models and dynamic scheduling in practice.
REFERENCES
[1] D. Lian, C. Zhao, X. Xie, G. Sun, E. Chen, and Y. Rui, “Geomf:
joint geographical modeling and matrix factorization for point-of-interest
recommendation,” in KDD, 2014, pp. 831–840.
[2] X. He, H. Zhang, M.-Y. Kan, and T.-S. Chua, “Fast matrix factorization
for online recommendation with implicit feedback,” in SIGIR, 2016, pp.
549–558.
[3] S. Li, J. Kawale, and Y. Fu, “Predicting user behavior in display
advertising via dynamic collective matrix factorization,” in SIGIR, 2015,
pp. 875–878.
[4] M. T. Al Amin, C. Aggarwal, S. Yao, T. Abdelzaher, and L. Kaplan,
“Unveiling polarization in social networks: A matrix factorization ap-
proach,” in INFOCOM, 2017, pp. 1–9.
[5] J. Qiu, Y. Dong, H. Ma, J. Li, K. Wang, and J. Tang, “Network
embedding as matrix factorization: Unifying deepwalk, line, pte, and
node2vec,” in WSDM, 2018, pp. 459–467.
[6] J. Pennington, R. Socher, and C. Manning, “Glove: Global vectors for
word representation,” in EMNLP, 2014, pp. 1532–1543.
[7] P. Cui, X. Wang, J. Pei, and W. Zhu, “A survey on network embedding,”
TKDE, vol. 31, no. 5, pp. 833–852, 2019.
[8] X. Xie, W. Tan, L. L. Fong, and Y. Liang, “Cumf_sgd: Parallelized
stochastic gradient descent for matrix factorization on gpus,” in HPDC,
2017, pp. 79–92.
[9] Y. Zhuang, W.-S. Chin, Y.-C. Juan, and C.-J. Lin, “A fast parallel sgd
for matrix factorization in shared memory systems,” in ACM RecSys,
2013, pp. 249–256.
[10] H. Yun, H.-F. Yu, C.-J. Hsieh, S. Vishwanathan, and I. Dhillon, “Nomad:
Non-locking, stochastic multi-machine algorithm for asynchronous and
decentralized matrix completion,” PVLDB, vol. 7, no. 11, pp. 975–986,
2014.
[11] C.-K. Luk, S. Hong, and H. Kim, “Qilin: exploiting parallelism on
heterogeneous multiprocessors with adaptive mapping,” in MICRO,
2009, pp. 45–55.
[12] P. Pandit and R. Govindarajan, “Fluidic kernels: Cooperative execution
of opencl programs on multiple heterogeneous devices,” in CGO. ACM,
2014, p. 273.
[13] E. Stehle and H.-A. Jacobsen, “A memory bandwidth-efficient hybrid
radix sort on gpus,” in SIGMOD, 2017, pp. 417–432.
[14] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded compu-
tations by work stealing,” JACM, vol. 46, no. 5, pp. 720–748, 1999.
[15] S. Funk, “Netflix update: Try this at home,”
https://sifter.org/~simon/journal/20061211.html, 2006.
[16] Y. Koren, R. Bell, and C. Volinsky, “Matrix factorization techniques for
recommender systems,” Computer, no. 8, pp. 30–37, 2009.
[17] H.-F. Yu, C.-J. Hsieh, S. Si, and I. Dhillon, “Scalable coordinate descent
approaches to parallel matrix factorization for recommender systems,”
in ICDM, 2012, pp. 765–774.
[18] Z. Jia, M. Maggioni, B. Staiger, and D. P. Scarpazza, “Dissecting the
nvidia volta gpu architecture via microbenchmarking,” arXiv preprint
arXiv:1804.06826, 2018.
[19] B. Recht, C. Re, S. Wright, and F. Niu, “Hogwild: A lock-free approach
to parallelizing stochastic gradient descent,” in NIPS, 2011, pp. 693–701.
[20] R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis, “Large-scale matrix
factorization with distributed stochastic gradient descent,” in KDD, 2011,
pp. 69–77.
[21] J. Oh, W.-S. Han, H. Yu, and X. Jiang, “Fast and robust parallel sgd
matrix factorization,” in KDD, 2015, pp. 865–874.
[22] B. Li, S. Tata, and Y. Sismanis, “Sparkler: Supporting large-scale matrix
factorization,” in EDBT, 2013, pp. 625–636.
[23] J. Jin, S. Lai, S. Hu, J. Lin, and X. Lin, “Gpusgd: A gpu-accelerated
stochastic gradient descent algorithm for matrix factorization,” Concur-
rency and Computation: Practice and Experience, vol. 28, no. 14, pp.
3844–3865, 2016.
[24] J. Canny and H. Zhao, “Bidmach: Large-scale learning with zero
memory allocation,” in BigLearning, NIPS Workshop, 2013.
[25] E. Zhong, Y. Shi, N. Liu, and S. Rajan, “Scaling factorization machines
with parameter server,” in CIKM, 2016, pp. 1583–1592.
[26] M. Pradeep, J. M. Navin, and B. Kannappan, “A review of schedul-
ing mechanisms for heterogeneous multi-core machines,” International
Journal of Pure and Applied Mathematics, vol. 120, no. 6, pp. 13–24,
2018.
[27] S. Mittal and J. S. Vetter, “A survey of cpu-gpu heterogeneous computing
techniques,” CSUR, vol. 47, no. 4, p. 69, 2015.
[28] R. A. Rossi and R. Zhou, “Leveraging multiple gpus and cpus for
graphlet counting in large networks,” in CIKM, 2016, pp. 1783–1792.
[29] V. J. Jiménez, L. Vilanova, I. Gelado, M. Gil, G. Fursin, and N. Navarro,
“Predictive runtime code scheduling for heterogeneous architectures,” in
HiPEAC, 2009, pp. 19–33.
[30] V. Leis, P. Boncz, A. Kemper, and T. Neumann, “Morsel-driven par-
allelism: a numa-aware query evaluation framework for the many-core
age,” in SIGMOD, 2014, pp. 743–754.
[31] M. E. Belviranli, L. N. Bhuyan, and R. Gupta, “A dynamic self-
scheduling scheme for heterogeneous multiprocessor architectures,”
TACO, vol. 9, no. 4, p. 57, 2013.
[32] M. Boyer, K. Skadron, S. Che, and N. Jayasena, “Load balancing in a
changing world: dealing with heterogeneity and performance variability,”
in CF, 2013, p. 21.
[33] H. J. Choi, D. O. Son, S. G. Kang, J. M. Kim, H.-H. Lee, and C. H.
Kim, “An efficient scheduling scheme using estimated execution time
for heterogeneous computing systems,” The Journal of Supercomputing,
vol. 65, no. 2, pp. 886–902, 2013.
[34] Z. Wang, L. Zheng, Q. Chen, and M. Guo, “Cap: co-scheduling based on
asymptotic profiling in cpu+ gpu hybrid systems,” in Proceedings of the
2013 International Workshop on Programming Models and Applications
for Multicores and Manycores, 2013, pp. 107–114.
[35] R. Kaleem, R. Barik, T. Shpeisman, C. Hu, B. T. Lewis, and K. Pingali,
“Adaptive heterogeneous scheduling for integrated gpus,” in PACT.
IEEE, 2014, pp. 151–162.
[36] D. Grewe and M. F. OâA˘Z´Boyle, “A static task partitioning approach
for heterogeneous systems using opencl,” in CC, 2011, pp. 286–305.
[37] K. Kofler, I. Grasso, B. Cosenza, and T. Fahringer, “An automatic input-
sensitive approach for heterogeneous task partitioning,” in ICS, 2013, pp.
149–160.
[38] Y. Wen, Z. Wang, and M. F. O’boyle, “Smart multi-task scheduling for
opencl programs on cpu/gpu heterogeneous platforms,” in HiPC, 2014,
pp. 1–10.
[39] A. Ghose, S. Dey, P. Mitra, and M. Chaudhuri, “Divergence aware
automated partitioning of opencl workloads,” in ISEC, 2016, pp. 131–
135.
[40] J. E. Stone, D. Gohara, and G. Shi, “Opencl: A parallel programming
standard for heterogeneous computing systems,” Computing in science
& engineering, vol. 12, no. 3, p. 66, 2010.
[41] “Insieme compiler runtime framework,” http://insieme-compiler.org.
[42] J. Lee, M. Samadi, and S. Mahlke, “Orchestrating multiple data-parallel
kernels on multiple devices,” in PACT, 2015, pp. 355–366.
[43] W.-S. Chin, Y. Zhuang, Y.-C. Juan, and C.-J. Lin, “A learning-rate
schedule for stochastic gradient methods to matrix factorization,” in
PAKDD, 2015, pp. 442–455.
