On Linear Learning with Manycore Processors by Wszola, Eliza et al.
On Linear Learning with Manycore Processors
Eliza Wszola
ETH Zurich
Zurich, Switzerland
eliza.wszola@inf.ethz.ch
Celestine Mendler-Du¨nner
IBM Research
Zurich, Switzerland
cdu@zurich.ibm.com
Martin Jaggi
EPFL
Lausanne, Switzerland
martin.jaggi@epfl.ch
Markus Pu¨schel
ETH Zurich
Zurich, Switzerland
pueschel@inf.ethz.ch
Abstract—A new generation of manycore processors is on the
rise that offers dozens and more cores on a chip and, in a sense,
fuses host processor and accelerator. In this paper we target the
efficient training of generalized linear models on these machines.
We propose a novel approach for achieving parallelism which
we call Heterogeneous Tasks on Homogeneous Cores (HTHC).
It divides the problem into multiple fundamentally different
tasks, which themselves are parallelized. For evaluation, we
design a detailed, architecture-cognizant implementation of our
scheme on a recent 72-core Knights Landing processor that is
adaptive to the cache, memory, and core structure. Our library
efficiently supports dense and sparse datasets as well as 4-
bit quantized data for further possible gains in performance.
We show benchmarks for Lasso and SVM with different data
sets against straightforward parallel implementations and prior
software. In particular, for Lasso on dense data, we improve the
state-of-the-art by an order of magnitude.
Index Terms—Manycore, performance, machine learning, co-
ordinate descent, GLM, SVM, Lasso
I. INTRODUCTION
The evolution of mainstream computing systems has moved
from the multicore to the manycore area. This means that
a few dozen to even hundreds of cores are provided on a
single chip, packaged with up to hundreds of gigabytes of
memory at high bandwidth. Examples include Intel Xeon Phi
(up to 72 cores), ARM ThunderX2 (64 cores), Qualcomm
Centriq 2400 (48 cores), and of course GPUs (100s of cores).
One declared target of the recent generation of manycores
is machine learning. While much work has been devoted to
efficient learning and inference of neural nets on GPUs, e.
g. [1], [2], other domains of machine learning and manycores
have received less attention.
One exciting trend in manycore is the move from ac-
celerators (like GPUs) to standalone manycore processors.
These remove the burden of writing two types of code and
enable easier integration with applications and legacy code.
However, the efficient mapping of the required mathematics
to manycores is a difficult task as compilers have inherent
limitations to perform it given straightforward C (or worse,
Java, Python, etc.) code, a problem that has been known
already for the earlier, simpler multicore and single core
systems [3]. Challenges include vector instruction sets, deep
cache hierarchies, non-uniform memory architectures, and
efficient parallelization.
The challenge we address in this paper is how to map ma-
chine learning workloads to manycore processors. We focus on
recent standalone manycores and the important task of training
generalized linear models used for regression, classification,
and feature selection. Our core contribution is to show that in
contrast to prior approaches, which assign the same kind of
subtask to each core, we can often achieve significantly better
overall performance and adaptivity to the system resources, by
distinguishing between two fundamentally different tasks. A
subset of the cores will be assigned a task A that only reads the
model parameters, while the other subset of cores will perform
a task B that updates them. So in the manycore setting, while
the cores are homogeneous, we show that assigning them het-
erogeneous tasks results in improved performance and use of
compute, memory, and cache resources. The adaptivity of our
approach is particularly crucial; the number and assignment
of threads can be adapted to the computing platform and the
problem at hand.
We make the following contributions:
1) We describe a novel scheme, consisting of two hetero-
geneous tasks, to train generalized linear models on ho-
mogeneous manycore systems. We call it Heterogeneous
Tasks on Homogeneous Cores (HTHC).
2) We provide a complete, performance-optimized imple-
mentation of HTHC on a 72-core Intel Xeon Phi pro-
cessor. Our library supports both sparse and dense data
as well as data quantized to 4 bits for further possible
gains in performance. Our code will be made publicly
available.
3) We present a model for choosing the best distribution
of threads for each task with respect to the machine’s
memory system. We demonstrate that with explicit con-
trol over parallelism our approach provides an order
of magnitude speedup over a straightforward OpenMP
implementation.
4) We show benchmarks for Lasso and SVM with different
data sets against straightforward parallel implementations
and prior software. In particular, for Lasso on dense data,
we improve the state-of-the-art by an order of magnitude.
II. PROBLEM STATEMENT & BACKGROUND
This section details the considered problem class, provides
necessary background on coordinate selection and introduces
our target platform: the Intel Knights Landing (KNL) many-
core processor.
ar
X
iv
:1
90
5.
00
62
6v
3 
 [c
s.P
F]
  2
6 J
ul 
20
19
A. Problem specification
We focus on the training of generalized linear models
(GLMs) which can be expressed as the following optimization
task:
min
α∈Rn
F(α) := f(Dα) +
∑
i∈[n]
gi(αi), (1)
where [n] = {1, . . . , n}, f and gi are convex functions, and
α ∈ Rn is the model to be learned from the training data
matrix D ∈ Rd×n with columns d1, . . . ,dn. In addition, f is
assumed to be smooth. This general setup covers many widely
applied machine learning models including logistic regression,
support vector machines (SVM), and sparse models such as
Lasso and elastic-net.
The models of this form have two important character-
istics. First, since g =
∑
i gi(αi) is separable, it is pos-
sible to perform updates on different column of the data
matrix independently. In particular, we can use stochastic
coordinate descent (SCD) to process the data set coordinate-
wise. This approach can be extended to batches. Second,
importance measures for individual coordinates are available.
Such measures depend either on the dataset, the current model
parameters or both. They can be used for the selection of
important coordinates during coordinate descent, speeding up
overall convergence, either by deriving sampling probabilities
(importance sampling) or by simply picking the parameters
with the highest importance score (greedy approach).
B. Duality-gap based coordinate selection
A particular measure of coordinate-wise importance that we
will adopt in our method, is the coordinate-wise duality gap
certificate proposed by Dnner et al. [4]. The authors have
shown that choosing model parameters to update based on
their contribution to the duality gap provides faster conver-
gence than random selection and classical importance sam-
pling [5].
Let g∗i denote the convex conjugates of gi. Then, the duality
gap (see [6]) of our objective (1) can be determined as
gap(α;w) =
∑
i∈[n]
gapi(αi;w), with
gapi(αi;w) := αi〈w,di〉+ gi(αi) + gi*(−〈w,di〉), (2)
where the model vector α,w are related through the primal-
dual mapping w := ∇f(Dα). Importantly, knowing the
parameters α;w, it is possible to calculate the duality gap
values (2) for every i ∈ [n] independently and thus in
parallel. In our implementation, we introduce the shared vector
v = Dα from which w can be computed using a simple linear
transformation for many problems of interest.
C. The Knights Landing architecture
Intel Knights Landing (KNL) is a manycore processor
architecture used in the second generation Intel Xeon Phi
devices, the first host processors, i.e., not external accelerators,
offered in this line. It provides both high performance (with
machine learning as one declared target) and x86 backwards
compatibility. A KNL processor consists of 64–72 cores with
low base frequency (1.3–1.5 GHz). KNL offers AVX-512, a
vector instruction set for 512-bit data words, which allows
parallel computation on 16 single or 8 double precision floats.
It also supports vector FMA (fused multiply-add) instructions
(e.g., d = ab + c) for further fine-grained parallelism. Each
core can issue two such instructions per cycle, which yields a
theoretical single precision peak performance of 64 floating
point operations (flops) per cycle. Additionally, AVX-512
introduces gather-scatter intrinsics facilitating computations on
sparse data formats. The KNL cores are joined in pairs called
tiles located on a 2D mesh. Each core has its own 32 KB L1
cache and each tile has a 1 MB L2 cache. The latter supports
two reads and one write every two cycles. This bandwidth
is shared between two cores. Each core can host up to four
hardware threads. KNL comes with two types of memory: up
to 384 GB of DRAM (6 channels with an aggregate bandwidth
of 80 GB/s as measured with the STREAM benchmark [7])
and 16 GB of high-bandwidth MCDRAM (8 channels and up
to 440 GB/s respectively). The MCDRAM is configurable to
work in one of three different modes: 1) cache mode in which
it is used as L3 cache, 2) flat mode in which it serves as a
scratchpad, i.e., a software-controlled memory (in this mode,
there is no L3 cache), 3) hybrid mode in which part is working
in cache mode and part in flat mode. In this paper, we use
a KNL with 72 cores, 1.5 GHz base frequency, and 192 GB
of DRAM in flat mode. The flat mode allows us to clearly
separate the memory needed by the subtasks characterized in
the next section.
III. METHOD DESCRIPTION
Our scheme uses block asynchronous coordinate descent
with duality-gap based selection as described in Section II-B.
The workflow, as illustrated in Figure 1, can be described as
two tasks A and B running in parallel. Task A is responsible
for computing duality gap values gapi based on the current
model α and the shared vector v. These values are then
stored in a vector z ∈ Rn which we call gap memory. In
parallel to task A, task B performs updates on a subset of
m coordinates, which are selected based on their importance
measure. For computing the updates on B we opt to use
parallel asynchronous SCD. Since B operates only on batches
of data, it is typically faster than A. Therefore, it is very likely
that A is not able to update all gaps during a single execution
of task B and some entries of the gap memory become stale
as the algorithm proceeds. In practice, the algorithm works
in epochs. In each epoch, B updates the batch of selected
coordinates, where each coordinate is processed exactly once.
At the same time, A randomly samples coordinates and
computes gapi with the most recent (i.e., obtained in the
previous epoch) parameters α,v and updates the respective
coordinate zi of the gap memory. As soon as the work of B
is completed, it returns the updated α and v to A. A pauses
its execution to select a new subset of coordinates to send
to B, based on the current state of the gap memory z. The
Fig. 1: Visualization of the HTHC approach.
robustness to staleness in the duality gap based coordinate
selection scheme has been empirically shown in [4].
A. Implementation challenges
We identify the most computationally expensive parts of
the proposed scheme. Task A computes the coordinate-wise
duality gaps (2), which requires an inner product between w,
computed from v, and the respective data column di:
zi = h(〈w,di〉, αi). (3)
h is a scalar function defined by the learning model with
negligible evaluation cost.
Task B performs coordinate descent on the selected subset
of the data. Thus, in each iteration of the algorithm, a
coordinate descent update on one entry of α is performed,
i.e., α+i = αi + δ. Again this update takes the form
δ = hˆ(〈w,di〉, αi), (4)
where hˆ is a scalar function. The optimal coordinate update
has a closed-form solution [8], [9] for many applications of
interest, and otherwise allows a simple gradient-step restricted
to the coordinate i. With every update on α we also update v
accordingly: v+ = v + δdi, to keep these vectors consistent.
The asynchronous implementation of SCD introduces two
additional challenges: First, staleness of the model information
v used to compute updates might slow down convergence or
even lead to divergence for a large number of parallel updates.
Second, writing to shared data requires synchronization, and
generates write-contention which needs to be handled by
appropriate locking.
IV. IMPLEMENTATION ON MANYCORE PROCESSORS
The main contribution of this paper is to show that a scheme
for learning GLMs based on multiple heterogeneous tasks
is an efficient solution for implementation on a standalone,
state-of-the-art manycore system such as KNL. As we will
see, our approach is typically over an order of magnitude
faster than simple C++ code with basic OpenMP directives.
Due to the need for careful synchronization, locking and
separation of resources, a straightforward implementation is
not efficient in the manycore setting: a detailed calibration
to the hardware resources and an associated implementation
with detailed thread control is the key. In the following we
will detail the challenges and key features for achieving an
efficient implementation.
A. Parallelization of the workload
Our implementation uses four levels of parallelism: 1) A
and B are executed in parallel. 2) A performs updates of zi
in parallel and B performs parallel coordinate updates. 3) B
uses multiple threads for each vector operation. 4) The main
computations are vectorized using AVX-512.
1) Allocation of resources to the tasks: To map HTHC
onto the KNL we divide compute and memory resources
among the two tasks A and B. We divide the compute
resources by assigning separate sets of cores (in fact tiles for
better data sharing) to each task. The respective number of
cores is a parameter that makes our implementation adaptive
to the problem and target platform. We use the threading
library pthreads for fine-grained control over thread affinity,
synchronization, and locking over critical regions of the code.
For more details we refer to Section IV-F. To split the memory
resources between the two tasks we use the KNL in flat
mode where the MCDRAM memory serves as a scratchpad.
This setting is particularly suited for our scheme because
we can allocate the data for A to DRAM and the data for
B to MCDRAM. As a consequence, saturating the memory
bandwidth by one task will not stall the other.
2) Parallelization of the individual tasks: For A, we use
only one thread for every update of a single zi due to the high
risk of deadlocks when computations on B are finished and A
receives a signal to stop. The number TA of threads used for
A is a parameter used for adaptation.
In contrast to A, B performs TB updates in parallel and
also parallelizes the inner product computation of each update
across VB threads. Thus, the total number of threads used by B
is TB · VB. Both are parameters in our implementation. When
VB threads are used per update, v and the corresponding di
are split into equal chunks.
A simple model can be used to determine a good choice
for VB as explained next. The performance of both the inner
product and the v update is limited by the memory bandwidth.
For this reason, it is desirable that v, which is reused, stays
in cache. To achieve this, the cache has to hold v and two
columns di, dj . Since v and di have the same length, this
means the chunk size should be about a third of the cache size,
i.e., about 87,000 single precision numbers for the L2 caches
in KNL. Optimizing with the same reasoning for the 32KB L1
cache would yield a length of v below 4096 elements. Such
short vectors would not benefit from parallelism due to issues
discussed later. Thus, we do not consider this setup applicable
to the L1 caches. The best choice for TB is influenced by
several factors as will be discussed in Section IV-F.
3) Vectorization with AVX-512: We implemented both the
scalar product (executed on both A and B) and the in-
crementation of v (performed on B) using AVX-512 FMA
intrinsics with multiple accumulators for better instruction-
level parallelism. The peak single core performance of KNL
is 64 flops/cycle, but in the scalar product, each FMA requires
two loads from L2 cache, reducing the peak to 16. In practice,
our entire coordinate update achieves about 7.2 flops/cycle,
about three times faster than without AVX.
B. Synchronization
Task A does not write to shared variables and thus requires
no synchronization between threads. In contrast, the updates
on B are performed with multiple threads per vector as
explained above. For the updates in (4), three barriers are
required to separate the resetting of the shared result from
the scalar product and the computation of hˆ based on the new
shared result.
For the implementation we use pthreads which provides
synchronization mechanisms with mutexes and thread barriers.
Since barriers are relatively expensive, we replace them with
a mechanism based on integer counters protected by mutexes
similar to [10].
In addition to synchronization per thread, we need to
coordinate running and stopping the tasks at the beginning and
the end of each epoch t (see Fig. 1). To avoid the overhead of
creating and destroying threads, we use a thread pool with a
constant number of threads for A and B. To synchronize, we
use another counter-based barrier scheme similar to the one
described above.
C. Atomic operations
We enforce atomic updates to the shared vector v to
preserve the primal-dual relationship between w and α and
thus maintain the convergence guarantees of asynchronous
SCD derived by Hsieh et al. [11]. The pthreads library does
not provide atomic operations, but the mutexes can be used
to lock chosen variables. To avoid overhead, we use medium-
grained locks for chunks of 1024 vector elements.
D. Sparse representation
To efficiently support also sparse datasets, we use a special
data structure for D akin to the CSC (compressed sparse-
column) format, while v and α remain in dense format. D
is represented as an array of structures containing pointers,
one for each column. Each column contains only the nonzero
elements, encoded as (index, value) pairs. B stores its own data
columns {di}i∈P in a similar way, with the columns addition-
ally split into chunks of a fixed length, implemented as linked
lists. This way, efficient movement of columns between A and
B into preallocated space is possible, accommodating possible
large differences in column length. The minimal chunk size
of 32 enables the use of multiple AVX-512 accumulators,
but the optimal size depends on the density of D. We use
locking as described in the previous section. Since the locks
are fixed to equal intervals of the dense vector v, the number
of operations performed under a given lock depends on the
density of the corresponding interval of di and 1024 might
no longer be an efficient choice of the lock size and vary on
each dataset. Initially, B allocates a number of empty chunks
determined by the m densest columns di in D, and places
the chunks on a stack. When A copies data to B, the pointers
to chunks are obtained from the stack and rearranged into
the linked lists, long enough to accommodate the new set of
columns processed by B. Next, the data of {di}i∈P is copied
to the corresponding lists. At the same time, the pointers to
the chunks linked by the lists corresponding to the columns
which are swapped out of B are returned to the stack. With this
representation, we observe fastest runtime when one thread is
used per vector: in most cases, the sparse vectors are shorter
than 130,000 elements.
E. Quantized representation
Stochastic quantization to reduced bit width reduces data
size while still maintaining performance for many iterative
optimization and machine learning algorithms, e.g. [12]. To
investigate and provide potential benefits, we extend HTHC
with support for 4-bit quantization using an adaptation of the
Clover library [13], which provides efficient quantized low-
level routines including the scalar product. We find that 4-bit
precision is enough to represent the data matrix D, without
significantly sacrificing model accuracy. For v and α, low
precision results in excessive error accumulation; thus we leave
those at 32-bit floating point. The overall benefit is reduced
data movement at the overhead of packing and unpacking 4-bit
data for computation. We show runtime results in Section V.
F. Balancing compute resources
A major challenge posed by the implementation of HTHC
is how to balance the computing resources across the differ-
ent levels of parallelism as discussed in Section IV-A. The
configuration of HTHC is parameterized by TA, TB, and VB,
and can be adjusted to the hardware and problem at hand.
We identified two important factors that impact the optimal
setting:
a) Balanced execution speed: If B works significantly
faster than A, the latter executes only few zi updates. As
a consequence most coordinate importance values become
stale, and convergence suffers. This effect has empirically been
investigated in [4], which showed that satisfactory convergence
requires about 15% or more of the zi being updated in each
epoch. We will discuss this further in Section V. On the
other hand, if B is too slow, the runtime suffers. Hence, the
efficiency of the implementation is a crucial factor that impacts
the best configuration.
b) Cache coherence: The parallelization of the gap mem-
ory updates on A across a large number of threads can lead to
DRAM bandwidth saturation. Additionally, more threads mean
higher traffic on the mesh, which can impact the execution
speed of B. For fast convergence, the threads must be assigned
so that A performs a sufficient fraction of zi updates in each
epoch. Our results will confirm r˜ = 15% as a safe choice.
c) Performance model: Let us consider dense data. Re-
call that we operate on the data matrix D ∈ Rd×n, where each
of the n coordinates corresponds to a column represented by
vector di of length d, and that B processes m coordinates per
epoch. Let tI,d(. . . ) denote the time of a single coordinate
05
10
15
20
25
30
35
40
0M 1M 2M 3M 4M 5M
perf.
vector size
1
24
12
8
4
2
16
20
32
Fig. 2: Performance (in flops/cycle) of synthetic A operations.
Different labels represent different values of TA.
update on task I ∈ {A,B} with vector length d. This function
is not trivial to derive, due to relatively poor scalability of
the operations used and the dependence on memory and
synchronization speed. Thus, we precompute the values for
different thread setups and d during installation and store them
in a table. Using this table, we then use the following model
to obtain the thread counts:
min
m,TA,TB,VB
m · tB,d(TB, VB) s.t. m · tB,d(TB, VB)
tA,d(TA)
≥ r˜ · n
(5)
V. EXPERIMENTAL RESULTS
We perform two sets of experiments. In the first set, we pro-
file HTHC on dense synthetic data with the aim to understand
and illustrate how the different implementation parameters
impact its performance. In the second set, we benchmark
HTHC on KNL on real-world data from small to very large.
We compare against a number of more straightforward variants
of implementing the same algorithm including standard C++
code with OpenMP directives, and against prior software
where available.
All experiments are run on a KNL in flat mode as described
in Section II-C. We compile our code with the Intel Compiler
Collection and flags -std=c++11 -pthread -lmemkind -lnuma
-O2 -xCOMMON-AVX512 -qopenmp. In all experiments, we
use at most one thread per core and single precision.
A. Algorithm profiling
To simulate different values of tI,d for different vector size
d (Section IV-F), we imitate the expensive operations of the
tasks A and B on dense synthetic data. The code measures
the overall time and performance for different vector sizes
and thread numbers. The operations involve the data matrix
D of size n× d and the shared vector v, with threading and
synchronization implemented as described in Section IV. In
the following we will illustrate results for varying data size
(n = 600 and d ranging from 10,000 to 5,000,000).
To analyze the impact of the parameter TA on the per-
formance of task A, we allocate both data structures to
DRAM and measure performance for TA ranging from 1
to 72. The results are presented in Fig. 2. We observe that
above 20 parallel updates, the performance does not increase
TABLE I: Data sets used in the experiments
Dataset Samples Features Representation Approx. Size
Epsilon [14] 400,000 2,000 Dense 3.2 GB
DvsC [15] 40,002 200,704 Dense 32.1 GB
News20 [16] 19,996 1,355,191 Sparse 0.07 GB
Criteo [17] 45,840,617 1,000,000 Sparse 14.4 GB
significantly and above 24 it even begins to decrease and
fluctuate due to the saturation of DRAM bandwidth. For this
reason, we use at most 24 threads for A.
To analyze the impact of the parameter VB and TB on the
performance of task B, we allocate D and v to MCDRAM.
Fig. 3 illustrates the impact of VB and shows results for TB =
{1, 4, 8, 16}. The single noisy data points in plots drawn for
larger numbers of threads are caused by background processes
which stall the execution of the program on particular cores.
We note that below d = 130, 000 it is best to use one thread
per vector, independent of the number of parallel updates. For
larger vectors, the best strategy is to use as many threads
per vector as possible. We observe that for the vector lengths
considered, higher performance is obtained with more parallel
updates rather than more threads per vector.
Fig. 4 shows the speedup of isolated B runs with different
values of TB over a run with TB = 1. For each value of TB,
we plot results for the runs with the best corresponding VB.
We observe that the algorithm used by B does not scale well.
This is due to many synchronization points during updates.
Profiling with Intel VTune shows that while the bandwidth
of L2 caches is a bottleneck on each tile, the saturation of
MCDRAM bandwidth remains low. For this reason, we benefit
from the flat mode, since it keeps MCDRAM as a separate
allocation space. The raw update speed of B, contrary to the
convergence of the complete scheme, is not affected by too
many parallel updates of v. In practice, the optimal value for
TB is rarely the maximum, as we will see in the following
experiments.
B. Performance evaluation
The second series of experiments compares the performance
of HTHC to several reference schemes of the two selected
linear models across three data sets of different size. We
consider Lasso and SVM on the two dense and two sparse
data sets in Table I. Dogs vs. Cats (abbrieviated as DvsC in
our tables) features were extracted as in [15] and the number
of samples was doubled. The same pre-processing was used
in [4]. The regularization parameter λ was obtained to provide
a support size of 12% for Lasso on Epsilon and Dogs vs. Cats,
and using cross validation in all other cases.
1) Comparison to our baselines: In the following we will
denote HTHC as A + B emphasizing that it runs two tasks,
A and B. As detailed in Section IV-A1, HTHC allocates the
data for A to DRAM and the data for B to MCDRAM.
For each experiment, except those on the Criteo dataset, we
used exhaustive search to find the best parameter settings, i.e.,
percentage of data updated by B per epoch %B, and the thread
010
20
30
40
50
60
70
0M 1M 2M 3M 4M 5M
perf.
vector size
1
36
16
8
4
2
24
32
(a) TB = 1
0
10
20
30
40
50
60
70
0M 1M 2M 3M 4M 5M
perf.
vector size
1
12
8
4
2
16
18
(b) TB = 4
0
10
20
30
40
50
60
70
0M 1M 2M 3M 4M 5M
perf.
vector size
1
9
8
4
2
(c) TB = 8
0
10
20
30
40
50
60
70
0M 1M 2M 3M 4M 5M
perf.
vector size
1
4
3
2
(d) TB = 16
Fig. 3: Performance (in flops/cycle) of operations of task B for different numbers TB of parallel updates. Different curves
represent different values of VB.
TABLE II: Best parameters found for Lasso.
Data set λ settings for A+ B settings for ST
%B TA TB VB Ttotal TB VB Ttotal
Epsilon 3e-4 8% 12 8 6 60 8 9 72
DvsC 2.5e-3 2% 16 14 1 30 20 1 20
News20 1e-4 2% 24 12 1 36 56 1 56
Criteo 1e-6 0.1% 8 64 1 72 72 1 72
TABLE III: Best parameters found for SVM.
Data set λ settings for A+ B settings for ST
%B TA TB VB Ttotal TB VB Ttotal
Epsilon 1e-4 4% 16 2 1 18 2 1 2
DvsC 1e-4 7% 8 6 10 68 36 2 72
News20 1e-5 49% 12 56 1 72 72 1 72
Criteo 1e-6 1% 4 68 1 72 72 1 72
settings TA, TB, VB described in Section IV. The obtained
parameters presented in Tables II, III roughly correspond to
the analysis in Section IV-F. We compare HTHC against four
reference implementations:
1) ST (single task): We consider a parallel, but homogeneous
single task implementation, which allocates the data ma-
trix D to DRAM and the remaining data to MCDRAM.
It performs randomized asynchronous SCD. We used the
same low-level optimizations in ST than in task B of
HTHC but without duality-gap-based coordinate selec-
tion. Instead, in each epoch we update v,α (allocated
to MCDRAM) for all coordinates of D. Again, we run
a search for the best parameters. These are shown in
Tables II and III.
2) ST (A+B): Like ST but run with the best setting of TB
and VB for A+ B.
3) OMP: A straightforward implementation of A+B: stan-
dard looped C code using the OpenMP directives simd
reduction and parallel for for parallelization
with the thread counts TA, TB and VB. To synchronize
the updates of v, we use directive atomic.
4) OMP WILD is as OMP, but without the atomic direc-
tive.
0
2
4
6
8
10
12
0M 1M 2M 3M 4M 5M
speed
up
vector size
2
24
16
8
4
36
Fig. 4: Speedup of runs with different number TB of parallel
updates over runs with a single update on B.
We perform OMP runs only for the dense representations. For
the large Criteo dataset, we consider only A+ B and ST due
to the long time to run all experiments.
Fig. 5 shows the results for Lasso and SVM for each data
set. Each plot shows the precision of the algorithm versus the
running time. For Lasso, we measure suboptimality, for Lasso
and SVM we show the duality gap.1 Each algorithm is run
until the duality gap reaches a parametrized threshold value
or until timeout.
First, we discuss the A + B, ST, and ST (A + B). For all
Lasso runs, we observe a speedup varying from about 5× for
Epsilon to about 9× for News20 compared to the best ST run,
depending on the desired precision. As expected, ST (A+B)
is never better than ST since the latter uses the best parameters
found during the search. The results for suboptimality are
consistent with those for the duality gaps.
For the SVM runs, we achieve 3.5× speedup for Dogs vs.
Cats and competitive performance for Epsilon and News20.
For Criteo we observe that the ST implementations beats
A+B. This is mostly due to skipping the update v = v+di×δ
when δ = 0. This leads to selection of relevant data based on
the result of 〈v,di〉, and avoids expensive locking at the same
time: thus, in some cases, ST drops enough operations to beat
the execution time and the overhead of A+ B.
Next we discuss the OpenMP runs. For OMP, as expected,
the atomic operations severely impact performance and thus
1To compute the duality gap for Lasso we use the Lipschitzing trick
from [18].
Lasso, Epsilon Lasso, Epsilon SVM, Epsilon
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 1 2 3 4
sub
opt.
time [s]
OMP WILD
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
1E+01
1E+02
1E+03
0 3 6 9 12
gap
time [s]
OMP WILD
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
1E+01
0 5 10 15
gap
time [s]
OMP
OMP WILD
Lasso, Dogs vs. Cats Lasso, Dogs vs. Cats SVM, Dogs vs. Cats
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 50 100 150 200
sub
opt.
time [s]
OMP WILD
1E-03
1E-02
1E-01
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
1E+06
1E+07
0 100 200 300 400
gap
time [s]
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 50 100 150 200 250
gap
time [s]
OMP WILD
Lasso, News20 Lasso, News20 SVM, News20
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 2 4 6 8
sub
opt.
time [s]
1E-04
1E-03
1E-02
1E-01
1E+00
1E+01
1E+02
1E+03
0 5 10 15 20 25 30 35
gap
time [s]
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 1 2 3
gap
time [s]
Lasso, Criteo Lasso, Criteo SVM, Criteo
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 10000 20000 30000 40000
sub
opt.
time [s]
1E+00
1E+01
1E+02
1E+03
1E+04
1E+05
0 10000 20000 30000 40000
gap
time [s]
1E-05
1E-04
1E-03
1E-02
1E-01
1E+00
0 400 800 1200 1600
gap
time [s]
Fig. 5: Convergence for Epsilon, Dogs vs. Cats, News20 and Criteo.
OMP WILD is much faster than OMP. While OMP WILD is
also faster than the standard HTHC implementations, it does
not guarantee the primal-dual relationship between w and α
and thus does not converge to the exact minimizer; hence the
plateau in the figures presenting suboptimality. The duality
gap computation gapi(αi; wˆ) is based on vˆ 6= Dα, and thus
do not correspond to the true values: therefore, the gap of
OMP WILD eventually becomes smaller than suboptimality.
The OMP run fails to converge on the Dogs vs. Cats dataset
with the used parameters.
C. Comparison against other parallel CD implementations
The work [4] implements a similar scheme for parallelizing
SCD on a heterogeneous platform: an 8-core Intel Xeon E5
x86 CPU with NVIDIA Quadro M4000 GPU accelerator (we
note that this is a relatively old GPU generation: the newer
accelerators would give better results). It provides results for
Dogs vs. Cats with B updates set to 25% (the largest size that
fits into GPU RAM): a suboptimality of 10−5 is reached in
TABLE IV: Comparison of A+B and ST against PASSCoDe
(no support for Lasso) for SVM.
Data set Accuracy A+ B ST PASSCoDe- PASSCoDe-
atomic wild
Epsilon 85% 0.35 s 1.11 s 0.70 s 0.64 s
DvsC 95% 0.51 s 0.69 s 2.69 s 1.66 s
News20 99% 0.14 s 0.06 s 0.02 s 0.01 s
TABLE V: Comparison of A + B and ST against Vowpal
Wabbit for Lasso.
Data set Squared error A+ B ST Vowpal Wabbit
Epsilon 0.47 0.56 s 0.62 s 12.19 s
DvsC 0.15 5.91 s 23.37 s 47.29 s
News20 0.32 0.94 s 0.76 s 0.02 s
40 seconds for Lasso and a duality gap of 10−5 is reached in
about 100 seconds for SVM. With the same percentage of B
updates, HTHC needs 29 and 84 seconds, respectively. With
our best setting (Fig. 5(d)–5(f)) this is reduced to 20 and 41
seconds, respectively. In summary, on this data, our solution
on the standalone KNL is competitive with a state-of-the-art
solution using a GPU accelerator with many more cores. We
also show that its performance can be greatly improved with
proper number of updates on B.
Additionally, we compare the SVM runs of HTHC (A+B)
and our parallel baseline (ST) against PASSCoDe [11], a
state-of-the-art parallel CD algorithm, which, however does
not support Lasso. We compare SVM against the variant
with atomic lock on v (PASSCoDe-atomic) and a lock-free
implementation (PASSCoDe-wild) which is faster, but does not
maintain relationship between model parameters as discussed
in Section IV-C. The results are presented in Table IV. On
Epsilon, the time to reach 85% accuracy with 2 threads (the
same as TB for ST) is 8.6 s for PASSCoDe-atomic and 3.21 s
for PASSCoDe-wild, but these times decrease to 0.70 s with 24
threads and 0.64 s with 12 threads respectively. For Dogs vs.
Cats, greatly increasing or decreasing the TB compared to ST
did not improve the result. For Dogs vs. Cats, we are 2.4–5×
faster, depending on the versions we compare. For Epsilon, we
are roughly 2× faster, but exploiting the the HTHC design is
required to prevent slowdown. On the other hand, PASSCoDe
works ca. 7× faster for the News20 dataset. We attribute this to
our locking scheme for v update which is beneficial for dense
data, but can be wasteful for sparse representations. Disabling
the locks brings the ST execution time down to 0.02 s.
We also compare the Lasso runs against Vowpal Wabbit
(VW) [19], which is considered a state-of-the-art software.
Since this library does not implement coordinate descent, we
opt for stochastic gradient descent. We run the computation
on previously cached data. We find that too many nodes
cause divergence for dense datasets and opt for 10 nodes as
a safe value. For News20, we use a single node. We compare
the average squared error of HTHC against the progressive
validation error of VW. The results are presented in Table V.
TABLE VI: Comparison of 32-bit to mixed 32/4-bit.
Dataset Model Target gap 32-bit 32/4-bit
Epsilon Lasso 10−5 1.6 s 1.0 s
Epsilon SVM 10−5 5.5 s 5.8 s
DvsC Lasso 10−3 55.5 s 32.4 s
DvsC SVM 10−5 38.2 s 51.6 s
Again we observe that while we perform well for dense data,
the training on sparse data is slow. Also, the runs on our code
and Vowpal Wabbit’s SGD result in two different scores for
News20.
D. Experiments on sensitivity
During the search for A+B, four parameters were consid-
ered: size of B, TA, TB and VB. Our goal was not only to
find the best parameters, but also identify parameters giving
a close-to-best solution. Fig. 6 presents parameters which
provided no more than overall 110% convergence time of the
best solution found. The overall convergence time depends
on the number of epochs which varies from run to run for
the same parameters: therefore, we consider all the presented
parameters capable of obtaining the minimum runtime. The
plots present four dimensions: the axes correspond to TB and
VB while different markers correspond to different %B. The
labels next to the markers correspond to TA. The color of each
label corresponds to its marker. Multiple values per label are
possible. To save time during the search, we use a step of
8 and 4 for TA on Dogs vs. Cats and Epsilon respectively.
Additionally, we use a step of 2 for TB on both datasets. We
also note that while Lasso on Epsilon converges fast for TB
greater than 8, the rate of diverging runs is too high to consider
it for practical application.
To examine how the number of zi updates per epoch on
A affects the convergence, we run tests in which we always
let A perform a fixed number of updates. We use the best
parameters found for different datasets and models, but we set
TA = 10. We present example results in Fig. 7. We observe
that relatively few updates are needed for the best execution
time : we observe it for 10% on the both datasets. While these
runs need more epochs to converge, the epochs are executed
fast enough to provide optimal overall convergence speed.
E. Evaluation of quantized representation
We run experiments on the dense datasets using the quan-
tized 4-bit representation of the data matrix D with the
modified Clover library. Table VI shows the comparison of
the fastest A + B runs using the mixed 32/4-bit arithmetic
to the fastest 32-bit A + B runs. We can observe that while
we reduce the data size, the computation times do not deviate
significantly from those obtained with 32-bit representation.
VI. RELATED WORK
Variants of stochastic coordinate descent [20] have become
the state-of-the-art methods for training GLMs on parallel
and distributed machine learning systems. Parallel coordinate
24
20, 24
20
12. 24,
12, 16, 20
16
20, 7
20, 8
12, 10
20, 4
12, 20
3
4
5
6
7
8
9
10
11
4 6 8 10 12
VB
TB
7%
8%
9%
10%
(a) Lasso on Epsilon
8, 16 8, 16
8 8 8
8 8
0
1
2
10 12 14 16 18 20
VB
TB
2.0%
2.5%
3.0%
3.5%
(b) Lasso on Dogs vs. Cats
8, 16, 24
8
8
8
8
6
8
10
12
5 6 7
VB
TB
7%
8%
9%
10%
(c) SVM on Dogs vs. Cats
Fig. 6: Parameter combinations (TB, VB) providing fast convergence (within 110% time of the best found).
0
1
2
3
4
5
6
0
50
100
150
200
250
300
350
0 10 20 30 40 50 60 70 80 90 100
speedup 
(times)
epochs
A updates (%)
0
1
2
3
4
5
6
7
200
250
300
350
400
0 10 20 30 40 50 60 70 80 90 100
speedup 
(times)
epochs
A updates (%)
Fig. 7: Sensitivity to the number of A updates per epoch for (left) Lasso on Epsilon and (right) SVM on Dogs vs. Cats.
descent (CD) has a long history, see e.g. [21]. Recent research
has contributed to asynchronous variants such as [22] who
proposed AsySCD, the first asynchronous SCD algorithm, and
[11] who proposed the more practical PaSSCoDe algorithm
which was the first to keep the shared vector v in memory.
There are only few works that have studied CD on non-
uniform memory systems (e.g. memory and disk). The ap-
proach most related to ours is [23] where the authors pro-
posed a strategy to keep informative samples in memory.
However, [23] is specific to the SVM problem an unable to
generalize to the broader class of GLMs. In [24] a more
general setting was considered, but the proposed random
(block) coordinate selection scheme is unable to benefit from
non-uniformity in the training data. In a single machine setting,
various schemes for selecting the relevant coordinates for CD
have been studied, including adaptive probabilities, e.g. [25]
or fixed importance sampling [5]. The selection of relevant
coordinates can be based on the steepest gradient, e.g. [26],
Lipschitz constant of the gradient [27], nearest neighbor [28]
or duality gap based measures [4]. In this work, we build on
the latter, but any adaptive selection scheme could be adopted.
Manycore machines, including KNL, are widely used
for deep learning, as standalone devices or within clusters,
e.g. [29], [30]. SVM training on multicore and manycore
architectures was proposed by You et al. [31]. The authors
provide evaluation for Knights Corner (KNC) and Ivy Bridge,
proving them to be competitive with GPUs. The LIBSVM
library [32] is implemented for both GPU [33] and KNC [34].
All SVM implementations use the sequential minimization
algorithm [35]. The library and its implementations are more
fitted for kernel SVM than the linear version. For training
on large-scale linear models, a multi-core extension of LI-
BLINEAR [36] was proposed by Chiang et al. [37]. This
library is tailored mainly for sparse data formats used e.g. in
text classification. While [11], [37] do not perform coordinate
selection, they use techniques like shrinking benefitting from
increasing sparsity of the output models. Rendle et al. [38]
introduced coordinate descent for sparse data on distributed
systems, achieving almost linear scalability: their approach can
be applied to multi- and manycore. The existing open-source
libraries support mainly sparse data and rarely implement CD
models other than SVM or logistic regression.
VII. CONCLUSIONS
We introduced HTHC for training general linear models
on standalone manycores including a complete, architecture-
cognizant implementation. We support dense, sparse and quan-
tized 4-bit data representations. We demonstrated that HTHC
provides a significant reduction of training time as opposed
to a straightforward parallel implementation of coordinate
descent. In our experiments, the speedup varies from 5× to
more than 10× depending on the data set and the stopping
criterion. We also showed that our implementation for dense
datasets is competitive against the state-of-the-art libraries and
a CPU-GPU code. An advantage of HTHC over the CPU-GPU
heterogeneous learning schemes is the ability of balancing dis-
tribution of machine resources such as memory and CPU cores
between different tasks, an approach inherently impossible on
heterogeneous platforms. To the best of our knowledge, this
is the first scheme with major heterogeneous tasks running in
parallel proposed in the field of manycore machine learning.
The inherent adaptivity of HTHC should enable porting it to
other existing and future standalone manycore platforms.
REFERENCES
[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “TensorFlow: A system for
large-scale machine learning.” in OSDI, vol. 16, 2016, pp. 265–283.
[2] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
and E. Shelhamer, “cuDNN: Efficient primitives for deep learning,”
CoRR, vol. abs/1410.0759, 2014.
[3] J. M. Moura, M. Pu¨schel, D. Padua, and J. Dongarra, “Special issue on
program generation, optimization, and platform adaptation,” Proceedings
of the IEEE, vol. 93, no. 2, pp. 211–215, 2005.
[4] C. Du¨nner, T. Parnell, and M. Jaggi, “Efficient use of limited-memory
accelerators for linear learning on heterogeneous systems,” in Advances
in Neural Information Processing Systems, 2017, pp. 4258–4267.
[5] P. Zhao and T. Zhang, “Stochastic optimization with importance sam-
pling for regularized loss minimization,” in Proceedings of the 32nd
International Conference on Machine Learning, ser. Proceedings of
Machine Learning Research, vol. 37, 2015, pp. 1–9.
[6] S. P. Boyd and L. Vandenberghe, Convex optimization. Cambridge
University Press, 2004.
[7] J. McCalpin, “Memory bandwidth and machine balance in high perfor-
mance computers,” pp. 19–25, 12 1995.
[8] S. Shalev-Shwartz and T. Zhang, “Stochastic dual coordinate ascent
methods for regularized loss minimization,” Journal of Machine Learn-
ing Research, vol. 14, no. Feb, pp. 567–599, 2013.
[9] S. J. Wright, “Coordinate descent algorithms,” Mathematical Program-
ming, vol. 151, no. 1, pp. 3–34, Mar. 2015.
[10] F. Franchetti. (2005) Fast barrier for x86 platforms. [Online]. Available:
www.spiral.net/software/barrier.html
[11] C.-J. Hsieh, H.-F. Yu, and I. Dhillon, “PASSCoDe: Parallel ASyn-
chronous Stochastic dual Co-ordinate Descent,” in International Con-
ference on Machine Learning, 2015, pp. 2370–2379.
[12] H. Zhang, J. Li, K. Kara, D. Alistarh, J. Liu, and C. Zhang, “Zipml:
Training linear models with end-to-end low precision, and a little bit of
deep learning,” in Proceedings of the 34th International Conference on
Machine Learning-Volume 70. JMLR. org, 2017, pp. 4035–4043.
[13] A. Stojanov, T. M. Smith, D. Alistarh, and M. Pu¨schel, “Fast quantized
arithmetic on x86: Trading compute for data movement,” in 2018 IEEE
International Workshop on Signal Processing Systems (SiPS). IEEE,
2018, pp. 349–354.
[14] Knowledge 4 All Foundation Ltd. (2008) Large scale
learning challenge. [Online]. Available: www.k4all.org/project/
large-scale-learning-challenge/
[15] C. Heinze, B. McWilliams, and N. Meinshausen, “Dual-loco: Dis-
tributing statistical estimation using random projections,” in Artificial
Intelligence and Statistics, 2016, pp. 875–883.
[16] R.-E. Fan. (2018) LIBSVM data: Classification (binary class). [Online].
Available: www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/binary.html
[17] Criteo Labs. (2014) Kaggle display advertising chal-
lenge. [Online]. Available: www.labs.criteo.com/2014/02/
kaggle-display-advertising-challenge-dataset/
[18] C. Du¨nner, S. Forte, M. Taka´cˇ, and M. Jaggi, “Primal-dual rates
and certificates,” in Proceedings of the 33rd International Conference
on International Conference on Machine Learning - Volume 48, ser.
ICML’16, 2016, pp. 783–792.
[19] J. Langford, L. Li, and A. Strehl, “Vowpal Wabbit,” 2011. [Online].
Available: www.github.com/VowpalWabbit/vowpal wabbit
[20] S. J. Wright, “Coordinate descent algorithms,” Mathematical Program-
ming, vol. 151, no. 1, pp. 3–34, 2015.
[21] P. Richta´rik and M. Taka´cˇ, “Parallel coordinate descent methods for big
data optimization,” Mathematical Programming, vol. 156, no. 1-2, pp.
433–484, 2016.
[22] J. Liu, S. J. Wright, C. Re´, V. Bittorf, and S. Sridhar, “An asynchronous
parallel stochastic coordinate descent algorithm,” J. Mach. Learn. Res.,
vol. 16, no. 1, pp. 285–322, Jan. 2015.
[23] K.-W. Chang and D. Roth, “Selective block minimization for faster
convergence of limited memory large-scale linear models,” in Proceed-
ings of the 17th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, 2011, pp. 699–707.
[24] S. Matsushima, S. Vishwanathan, and A. J. Smola, “Linear support
vector machines via dual cached loops,” in Proceedings of the 18th
ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, 2012, pp. 177–185.
[25] D. Perekrestenko, V. Cevher, and M. Jaggi, “Faster coordinate descent
via adaptive importance sampling,” Proceedings of the 20th Interna-
tional Conference on Artificial Intelligence and Statistics, vol. 54, 2017.
[26] Y. You, X. R. Lian, J. Liu, H.-F. Yu, I. S. Dhillon, J. Demmel,
and C.-J. Hsieh, “Asynchronous parallel greedy coordinate descent,” in
Proceedings of the 30th International Conference on Neural Information
Processing Systems, 2016, pp. 4689–4697.
[27] A. Zhang and Q. Gu, “Accelerated stochastic block coordinate descent
with optimal sampling,” in Proceedings of the 22Nd ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining,
2016, pp. 2035–2044.
[28] I. E.-H. Yen, C.-F. Chang, T.-W. Lin, S.-W. Lin, and S.-D. Lin, “Indexed
block coordinate descent for large-scale linear classification with limited
memory,” in Proceedings of the 19th ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2013, pp. 248–
256.
[29] N. Gawande, J. B. Landwehr, J. A. Daily, N. R. Tallent, A. Vishnu,
and D. J. Kerbyson, “Scaling deep learning workloads: NVIDIA DGX-
1/Pascal and Intel Knights Landing,” 2017 IEEE International Parallel
and Distributed Processing Symposium Workshops (IPDPSW), pp. 399–
408, 2017.
[30] Y. You, Z. Zhang, C.-J. Hsieh, and J. Demmel, “100-epoch ImageNet
training with AlexNet in 24 minutes,” CoRR, vol. abs/1709.05011, 2017.
[31] Y. You, S. L. Song, H. Fu, A. Marquez, M. M. Dehnavi, K. Barker,
K. W. Cameron, A. P. Randles, and G. Yang, “MIC-SVM: Designing a
highly efficient support vector machine for advanced modern multi-core
and many-core architectures,” in 2014 IEEE 28th International Parallel
and Distributed Processing Symposium, 2014, pp. 809–818.
[32] C.-C. Chang and C.-J. Lin, “LIBSVM: A library for support vector
machines,” ACM Trans. Intell. Syst. Technol., vol. 2, no. 3, pp. 27:1–
27:27, May 2011.
[33] A. Athanasopoulos, A. Dimou, V. Mezaris, and I. Kompatsiaris, “GPU
acceleration for support vector machines,” in Procs. 12th Inter. Workshop
on Image Analysis for Multimedia Interactive Services (WIAMIS 2011),
Delft, Netherlands, 2011.
[34] R. Massobrio, S. Nesmachnow, and B. Dorronsoro, “Support vector
machine acceleration for Intel Xeon Phi manycore processors,” in Latin
American High Performance Computing Conference. Springer, 2017,
pp. 277–290.
[35] J. Platt, “Sequential minimal optimization: A fast algorithm for training
support vector machines,” Tech. Rep., April 1998.
[36] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“Liblinear: A library for large linear classification,” J. Mach. Learn.
Res., vol. 9, pp. 1871–1874, Jun. 2008.
[37] W.-L. Chiang, M.-C. Lee, and C.-J. Lin, “Parallel dual coordinate
descent method for large-scale linear classification in multi-core en-
vironments,” in Proceedings of the 22Nd ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, 2016, pp. 1485–
1494.
[38] S. Rendle, D. Fetterly, E. J. Shekita, and B.-y. Su, “Robust large-
scale machine learning in the cloud,” in Proceedings of the 22Nd ACM
SIGKDD International Conference on Knowledge Discovery and Data
Mining, 2016, pp. 1125–1134.
