L3 Fusion: Fast Transformed Convolutions on CPUs by Gelashvili, Rati et al.
L3 Fusion: Fast Transformed Convolutions on CPUs
Rati Gelashvili ∗ Nir Shavit ∗ † Aleksandar Zlateski ‡ §
Abstract
Fast convolutions via transforms, either Winograd or FFT, had emerged as a preferred way of performing
the computation of convolutional layers, as it greatly reduces the number of required operations. Recent
work shows that, for many layer structures, a well–designed implementation of fast convolutions can greatly
utilize modern CPUs, significantly reducing the compute time. However, the generous amount of shared L3
cache present on modern CPUs is often neglected, and the algorithms are optimized solely for the private
L2 cache. In this paper we propose an efficient ‘L3 Fusion‘ algorithm that is specifically designed for CPUs
with significant amount of shared L3 cache. Using the hierarchical roofline model, we show that in many
cases, especially for layers with fewer channels, the ‘L3 fused‘ approach can greatly outperform standard 3
stage one provided by big vendors such as Intel. We validate our theoretical findings, by benchmarking our
‘L3 fused‘ implementation against publicly available state of the art.
1 Introduction
While the Convolutional Neural Networks (ConvNets, or CNN s) have been proposed, in their current form,
at the turn of the century [12]; it took more than a decade for them to gain a significant traction in the
scientific community. The superior performance of AlexNet [10] in the field of image recognition, together
with advances in the computational power of GPUs triggered current popularity of ConvNets. A whole new
field of ‘Deep Learning‘ has emerged since, studying ConvNets and their applications in various domains.
ConvNets were initially mostly used for research within the academic community and industry research
and were running on expensive GPU clusters; However, they have since spread to nearly all industries and
are running ‘in production‘ on wide variety on devices – ranging from traditional GPUs, CPUs all the way to
mobile processors and specialized hardware. As they are computationally very expensive it is very important
to have optimized algorithms and implementations across large variety of ConvNet architectures and devices
in order to achieve satisfactory speeds and save energy.
In this paper we focus on modern CPUs, both server and desktop grade. Most of the current CPU–based
algorithms emerged with the introduction of Intel Xeon Phi co–processor, which had many integrated x86
cores that provided a competitive computational power to GPUs. It also provided on–chip high–bandwidth
memory (MCDRAM), but had limited amount of L2 cache, and completely lacked L3 cache. In the meantime
both server and desktop grade CPUs had reached, and even exceeded the computational power of the Intel
Xeon Phi processors (which quietly got discontinued [2]); however, they lack the high–bandwidth on–chip
memory, but instead provide a generous amount of L3 cache.
The most efficient available implementations use ‘fast convolution algorithms‘ – that is, they perform
convolutions through a basis transform, either FFT or Winograd. Such algorithms provide a great reduction
in the number of required floating point operations (FLOPs), but require much more sophisticated design
and implementation in order to effectively utilize the underlying hardware.
Nearly all current implementations [18, 1, 4, 3] of ‘fast convolutions‘ do not take the L3 cache into account.
While they can still get high utilization in many scenarios on modern CPUs with decend amount of L3 cache,
∗NeuralMagic Inc, Somerville, MA
†MIT, Cambridge, MA
‡Facebook AI Research, New York, NY
§Work was done when the author was with NeuralMagic
1
ar
X
iv
:1
91
2.
02
16
5v
1 
 [c
s.D
C]
  4
 D
ec
 20
19
there are many cases where an L3-cache aware algorithm can significantly boost the performances. This is
specifically true for layers of the most popular ConvNets, such as ResNet [7], or VGG [15].
In this paper, we propose a L3–cache aware algorithm for the most computationally intensive – convolu-
tional layers. While the overall idea is quite intuitive – keeping data reused among cores in L3 cache, the
novelty lies within the details of the algorithm.
We analyze our L3–cache aware algorithm using the roofline model in order to predict in which scenarios it
is expected to outperform currently available implementations. We benchmarked our L3 aware implementation
against the state of the art publicly available implementations, as well as our own baseline implementation.
The results conform to our expectations. As expected, the fused implementation is sometimes slower and
sometimes faster than the state of the art; while consistently being faster by 50% on average on layers with
lower number channels.
2 Background
2.1 Convolutional Layers through Transforms
Here we’ll focus on 2D ConvNets, and follow the notation introduced by [16]. The input and the output
of a convolutional layer are 4-dimensional tensors with shapes of B × C ×D ×W and B × C ′ ×D′ ×W ′
accordingly. Here B is the batch size, C (C ′) is the number of input (output) channels, respectively, and D
(D′) and W (W ′) are spatial dimensions of each input (output) channel.
We’ll further assume isotropic kernels of size KD ×KW = K2. It should be straight forward for the
reader to generalize our approach to higher dimensional ConvNets and non–isotropic kernels. All the kernels
thus, have a 4-dimensional shape of C ′ × C ×K ×K.
We will assume 32-bit floating point numbers, as it is standard for all CPU–based implementations of
fast convolutional algorithms (using transforms). FFT operates on complex numbers, which simply requires
storing a pair of 32-bit floats.
Convolutional layers optionally have a padding; it typically used to have the input and output of the
layer have the same spatial dimensions. Our algorithm can work for any padding, and implementation uses
implicit padding which could be more efficient than explicitly padding the data.
A convolutional layer transforms an input tuple of C images into an output tuple of C ′ images. A batch
of B inputs yielding a batch of B outputs is processed at the time via
I ′b,c′ =
C∑
c=1
Ib,c∗Wc′c (1)
When using ‘fast convolutions‘ (Winograd or FFT), the output images are computed via
I ′b,c′ =
C∑
c=1
[
(Wc,c′×Nn=1Gn) (Ib,c×Nn=1Bn)
]
×Nn=1ATn
=
[ C∑
c=1
(Wc,c′×Nn=1Gn) (Ib,c×Nn=1Bn)
]
×Nn=1ATn
(2)
Here,  represents element–wise multiplication, and the operation x×Nn Y is short for x×1 ×2 · · · ×n Y,
where ×n represents tensor–matrix mode–n multiplication as defined in [9, 6]. An, Bn and Gn are transform
matrices along dimension n. For Fourier transform, complex, matrices, the tensor–matrix multiplication can
be implemented more efficiently using the FFT algorithm.
For the 2D case, and isomorphic transform sizes, the formula above reduces to
I ′b,c′ = A
[ C∑
c=1
(GWc,c′GT ) (BIb,cBT )
]
AT (3)
Which was also proposed in [11].
2
Here, the choice of transform matrices assume a particular size of I and I ′. The Winograd transform
matrices produce numerically stable results only for relatively small sizes. The convolution of larger images
is performed using the overlap–add method (OLA) [14]. With OLA, the input images are divided into tiles
with sizes of T , and an overlap of K − 1 along each dimension. Considering tiles at the same location from
all the input images, tiles of size T ′ = T −K + 1 of the output images are computed using the formula above.
When FFT transforms are used, the transformed images will be conjugate anti–symmetric. This allows
for approximate 2x savings in both storage and compute [18].
Traditionally, the CPU implementations [3, 4, 8, 1] perform such computation in a serialized fashion.
First, all the inputs and kernels are transformed. Then, the element–wise computation is performed using
matrix multiplication; finally, the result is then transformed yielding the result of the convolutional layer.
2.2 Cache hierarchy
In a typical modern CPU processor there are several layers of cache. On most Intel processors for example, the
third level of cache, known as the L3 cache, is relatively large and shared among all of a on–chip computing
cores. Other level caches, such as L1 and L2, are faster and private to a specific core.
Caching happens at a granularity of cache lines (typically 64 bytes) and if the cached data has a bad
alignment in memory, this results in the overhead of unnecessary data being brought to the cache along
with the cache lines. The best (cache-friendly) memory layout is typically storing (and accessing) data
consecutively, in cache line size increments, starting from a beginning of a cache line.
Note that the cache coherence protocol is proprietary to the hardware producer, so we cannot guarantee
the cache behavior. However when a contiguous chunk of data is frequently accessed (of size smaller than the
cache size), we can confidently expect these accesses to be cache hits most of the time for any reasonable
cache coherence protocol.
It is worth noting here that while algorithms executed on modern CPUs often implicitly benefit from the
existence of the L3 cache. Algorithms that structure the computation to explicitly benefit from the shared
cache are not as common. This is because, as opposed to private L2 cache, when multiple processors and the
resulting asynchronous nature of computation is involved, it is harder to reliably expect desired data to be in
the shared cache. However, in our case, we can structure the transformed convolutions in such a way that all
processors repeatedly (in each computation task determined by the algorithm) access the same data, keeping
it sufficiently ‘hot‘ to avoid eviction from the shared cache; in addition, a limited amount of additional data
is accessed, which further avoids cache pollution.
2.3 Roofline Model
We use Roofline Performance Model [17] to theoretically analyze our algorithms. This model provides a
simple, yet a powerful framework for estimating the limit of compute utilization of an algorithm based on
the memory movement and the amount of operations performed.
For caches and memory, the compute-to-memory ratio (CMR) is defined as the ratio between the peak
theoretical compute performance (typically FLOPS – Floating operations per second) and the memory/cache
bandwidth (bytes/second). Similarly, Arithmetic intensity (AI) of an algorithm is the number of compute
operations for each byte transferred to or from memory. When the arithmetic intensity of an algorithm
executed on some architecture is not higher than the compute-to-memory ratio of some memory level on that
architecture, the execution will necessarily be memory bound, i.e. bottlenecked on bringing the data in or
out of that memory level at some point in the execution; and the compute resources will have an utilization
upper bounded by AI/CMR.
Otherwise, if AI is greater than the architecture’s CMR, at a certain level of memory hierarchy, the
execution of the algorithm is compute bound, i.e. never limited by the memory bandwidth at that memory
level.
However, being actually compute bound depends on how the memory accesses are distributed - if the
algorithm performs all memory accesses before performing all computation, the average compute utilization
3
of the whole algorithm may be high, despite the first stage of its execution being extremely memory bound.
But when an algorithm has a reasonably uniform memory access distribution, then it is likely to utilize a
fraction of the CPUs available FLOPS which is close to its theoretical maximum (i.e. minimum of compute
utilization among all memory levels).
3 Fast Convolutions
Both Winograd and FFT fast convolution algorithms are parameterized by the tile size T they use. The
tile sizes for the Winograd are usually small (4–6), as larger tiles yield a numerically unstable results [11, 8].
All FFT tile sizes are generally stable; the sizes are then chosen such that the total number of operations is
minimized, and the compute fits in caches [18].
The output is computed tile-by-tile, where each tile has a shape T ′ × T ′. The output tiles don’t overlap,
but the pre-image necessary to compute a given output tile consists of C input tiles of shape T × T aligned
across input channels. Moreover, these C input tiles also form the exact pre-image for computing all C ′
output tiles aligned with the given output tile across output channels.
To compute a single tile of size T ′ = T −K + 1 of each of the C ′ output channels the following steps
need to be performed.
1. C input tiles (tiles at the corresponding location in each of C input channels) are transformed. This
yields C transformed tiles of size T 2, which are then interpreted as T 2 vectors of size C.
2. Each vector is multiplied by a corresponding matrix of transformed kernels. There are T 2 right-
hand/kernel matrices (one for each location in the input tile, corresponding one-to-one to the vectors)
obtained ahead of time by transforming the kernels 1.
The output is T 2 vectors of size C ′ that can be interpreted as C ′ tiles of size T × T .
3. Each of these C ′ tiles must finally be inverse transformed, resulting in C ′ output tiles of shape T ′ × T ′.
These are the outputs of the convolutional layer at the corresponding locations.
The above computation works with all C input tiles and C ′ output tiles at once. This is the standard
way of operation of transformed convolutions. We discuss in Section 7 whether in some cases, it could be
beneficial to break the convolution into multiple convolutions operating on subsets of channels.
The whole convolutional layer is computed when all output tiles are computed. For this, the above 3-step
computation needs to happen Ntile times, where Ntile = B · d(D−K + 1)/T e · d(W −K + 1)/T e if we assume
no padding.
As noticed above, the state-of-the-art implementations perform the computation in three fully separate
stages. They perform the first step for all the tiles of all batches; this creates T 2 matrices of size Ntile × C.
Then the second steps are performed by multiplying these T 2 matrices with the T 2 kernel (right-hand side)
matrices, yielding T 2 matrices of size Ntile × C ′. Finally, all output tiles are computed by transforming the
result.
The second stage operates on T 2 matrix pairs of sizes Ntile ×C and C ×C ′ which are typically large and
don’t fit in cache. Thus, one matrix multiplication is performed at the time – total of T 2 times. Note that it
is not possible to generate only a single matrix at the time, as each transform generates an element from T 2
distinct matrices. Similarly, the output transforms can not be performed unless the elements at the same
location of all T 2 matrices at the same locations were not already computed.
On modern CPUs with large CMRs, the stages 1 and 3 are memory bound as they perform relatively
small operations per transferred byte, while only the second stage could be compute bound.
This design design is reasonable for architectures with limited L3 cache (or ones with high bandwidth
memory), as the T 2 kernel matrices can not be stored in cache. However, in the case when the T 2 kernel
1This is a typical assumption: when performing inference on a trained network, kernels don’t change - they can be transformed
and stored in a way that is most suitable for the convolution implementation being used. Even for training, there are approaches
that store and update transformed kernels, e.g. [13].
4
(right-hand-side) matrices can fit in shared cache, we can consider performing steps 1–3 for only a small subset
of tiles. This will allow for larger overall arithmetic intensity, and is the basis of our approach described in
the next section.
4 Algorithm Design
Our algorithm is based on a simple, yet crucial observation that the same T 2 right-hand (kernel) matrices are
used in the second step of the transformed computation, regardless of the output tile. While these matrices
may not fit in the smaller L2 private cache of a processor, typically they can comfortably fit (i.e. occupy at
most a constant fraction of the total cache size, leaving enough space for other intermediate data used by the
processors while executing the algorithm) in the larger shared L3 cache. Fetching memory from the L3 cache
is faster than from the main memory, opening an avenue to structure the computation differently from the
existing implementations.
Unlike the state of the art approaches that perform all Ntile instances of the i-th step in a single i-th
stage, our algorithm groups R steps together into a task, creating a total of Ntask = dNtile/Re of tasks.
Each task consists of transforming R groups of C input tiles (R instances of step 1) aligned across the input
channel, resulting in T 2 left-hand matrices of shape R×C, then multiplying these matrices by the right-hand
matrices and inverse transforming the results.
Notice that dNtile/Re tasks can be computed independently of each other, in any order or even in parallel.
A number of load balancing schemes can be used to schedule/execute the tasks on available processors. In
any case, every processor fully accesses each right-hand matrix while computing each task, ensuring that
these matrices remain extremely “hot” for caching purposes. Moreover, the right-hand matrices are only read
in each task, but never updated (i.e. we don’t have to worry about bus traffic for cache coherency). We store
the right-hand matrices in a cache-friendly memory layout. Hence, assuming a reasonable cache coherency
protocol (and that the size of the matrices is at most a constant fraction of the available L3 cache), we can
indeed expect right-hand matrices to be accessed from the shared L3 cache as opposed to from the main
memory.
4.1 Parameters and Constraints
4.1.1 Number of Channels and the Tile Size
When choosing the tile size T , we have to abide by constraints arising from the nature of the transformed
computation itself [8, 18]. Making tile size larger decreases total area of overlap between input tiles and
increases efficiency, but for larger shapes increasing them further brings diminishing returns. FFT and
Winograd also have specific constraints 2. Based on the existing work, we can say that common T that works
well for Winograd is in the [5, 8] range (7 or 8 being better) and T of at least 16 work well for FFT [8, 18]
The L3 cache of modern CPUs have a typical size of 1-2MB per core (e.g. on Intel processors 1.375MB
for SkylakeX and approximatelly 2.5mb on Desktop Haswell architectures). The right-hand matrices require
CC ′T 2 numbers to be stored; this adds to 4CC ′T 2 for both Winograd and FFT 3 FFT with T = 16 would
require 1MB total for 32 input and output channels and 4MB total for 64 input and output channels.
Winograd with T = 8 would require 4MB even with 128 input and output channels. For more cores (e.g. 18),
we could go to executing our algorithm on up to 128 channels for FFT and 256 channels for Winograd.
Hence, we can set C,C ′ and T such that right-hand matrices fit in the L3 cache and tiles make sense
from the transformed computation prospective. In the following, we focus our attention on this meaningful
setting of parameters and consider the constraints and implications for the remaining key parameter R.
2FFT operates on complex numbers, while Winograd operates on reals but suffers from numerical precision issues for larger
tile sizes
3While the FFT needs to store complex numbers, that require a pair of floats per number, it only has to store half of the
matrices, due to conjugate anti-symmetric nature of the FFT tranforms.
5
The main, non-trivial question for the remaining of the paper is whether the parameter R can be chosen,
and the actual details of the algorithm designed in such a way that it is more efficient than the 3-stage
approach. We answer this question in the affirmative from both theoretical and practical prospective.
4.1.2 Setting of R
The implementation of the forward transform can utilize streaming instructions for reading the input tiles,
and similarly, the inverse transform can utilize streaming writes for output tiles. Right-hand matrices can
be streamed from the L3 cache for multiplication. However, we need to ensure that during the matrix
multiplications, left-hand matrices and the results of the multiplication do not fall out of the L2 cache. More
precisely:
Lower Bound: R has to be large enough that the matrix multiplications have a large arithmetic intensity
to give a good compute utilization, given that the right-hand matrices are read from the L3 cache, while
input tiles are read from and output tiles are written to the main memory.
Upper Bound: R has to be such that the left-hand matrices together with the resulting matrices fit
into L2 cache, so that (a) left-hand matrices are in the L2 cache while the results of the multiplication are
computed and (b) the results are read from the L2 cache by the subsequent inverse transform.
Notice that violating the upper bound means that we may have to access main memory (or sometimes,
L3 cache) instead of L2 cache for crucial intermediate data, which is likely to greatly and adversely affect the
performance. Therefore, we treat the upper bound as a hard constraint. On the other hand, lower bound
is a softer constraint. There is no single threshold that has to be satisfied - instead there is a range where
increasing R improves the compute utilization of the matrix multiplications.
Generally we want R as large as possible, but not higher than the upper bound. We perform the roofline
analysis in Section 5. But first, we introduce a technique that is critical for the efficiency of our algorithm.
Without this technique, the upper bound constraint would force us to pick smaller R, which might limit us
in achieving a good arithmetic intensity. With this technique, we can fit more data in L2 cache, relaxing
upper bound constraint almost by a factor of two. This lets us pick a larger R and directly translates into
better compute utilization.
4.2 Shared Buffer
Recall that the left-hand matrices have shape R×C and the result matrices of the multiplication have shape
R×C ′. There are T 2 pairs of these matrices. As the upper bound constraint in the previous section dictates,
we would like to keep all these matrices at (no lower than) the L2 cache of the processor performing the task.
In particular, we would like the left-hand matrices to live in the L2 cache from the moment when they are
generated by the forward transform to the moment when they are accessed for the matrix multiplication
purposes. Similarly, we would like the matrix multiplication results to live in the L2 cache by the time of the
inverse transform accesses.
The processor is performing T 2 matrix-matrix multiplications whereby the right-hand matrices are
streamed from the L3 cache. So, a standard way for us to satisfy the constraint is to pick R such that the
total amount of memory required for both left-hand and result matrices, T 2 · (4RC+ 4RC ′) = 4RT 2 · (C+C ′)
bytes, fits in a constant fraction (typically 50%) of the processors L2 cache. The shared buffer allows us
to use significantly less space, T 2Smax + Smin to be precise, where Smax = max(4RC, 4RC
′) and Smin =
min(4RC, 4RC ′). Smax and Smin represent the the maximum and minimum, respectively, between the
memory requirement of a left-hand matrix and memory requirement of a result matrix.
Figure 1 illustrates the savings of memory in the L2 cache using the shared buffer. The left-hand matrices
are stored consecutively aligned with the end of the buffer, leaving some extra unused space in the beginning
of the buffer. In example (a) Smax = Smin (i.e. C = C
′), so the result of the first matrix multiplication is
written precisely to the extra space at the beginning of the shared buffer, while the result of the first matrix
multiplication is stored precisely in place of the first left-hand matrix, which is no longer needed. The key
aspect is that once i-th multiplication happens, the i-th left-hand matrix is no longer needed, and its space
can be reused for the subsequent result matrices.
6
(a) Smin = Smax = 32 bytes (b) Smin = 24 bytes and Smax = 40 bytes
Figure 1: Simple examples with 4 matrix multiplications. (a): each left-hand matrix and the result matrix
has size 32 bytes, occupying 8 slots (e.g. 4-byte floats). The memory storing left-hand matrices are colored
in shades of green, and the memory storing the results of multiplications are colored in shades of blue. The
buffer occupies 40 slots (160 bytes) providing an 37.5% improvement versus storing these matrices separately,
which would require 32 · 2 = 64 slots (256 bytes). (b) in this example, left-hand matrices occupy 24 bytes (6
slots) while results of multiplication occupy 40 bytes (10 slots) each. Here shared buffer provides 28.125%
savings (46 slots vs 64 slots).
The scheme is general and works for arbitrary sizes. In fact, when the size of a result matrix is smaller or
equal than the size of a left-hand matrix, at the end of the all T 2 multiplications the shared buffer contains
all result matrices, followed by the last left-hand matrix. Figure 1-(b) provides a more general example with
this property.
In general, the results of the i-th multiplication may overwrite contents of up-to (i − 1)-st left-hand
matrices, but never the i-th left-hand matrix (or later ones), as matrix multiplication can not be efficiently
performed in–place4. However, overwriting left-hand matrices used by completed (earlier) multiplications is
always safe.
5 Theoretical Analysis
In Section 4.1 we established that for reasonable tile sizes for the transforms, all right-hand matrices can
comfortably fit in the modern CPUs’ L3 cache of modern processors if we consider up to 128 or 256 (64 or
128), input and output channels for Winograd (or FFT), respectively.
However, we still need to ensure that we can pick an R that (a) gives sufficient compute utilization for
our algorithm given all the required L3 and main memory accesses, and (b) simultaneously, allows shared
buffer to fit in a constant fraction (around 50%) 5 of a processor’s L2 cache.
5.1 Arithmetic Intensity vs CMR
For the purposes of this section, we assume that the intermediate data while performing tasks do not spill
beyond the L2 cache. This is a hard requirement which we consider below.
The arithmetic complexity of each task is at least α2RCC ′T 2, which is the number of FLOPs required to
perform the matrix-matrix multiplications after the forward transform. Here α is 1 for Winograd and 2 for
FFT [18].
4For efficient implementation, an input of matrix multiplication may be read in any order or more than once, and cannot be
overwritten.
5Typical design choice.
7
The amount of memory that is read from the L3 cache is 4CC ′T 2. This gives a CMR of R/2 for the L3
cache. While there’s no available specification of the L3 bandwidth on Intel processors, we know that they
operate on cache–ring frequency, and are generally constant regardless of the number of cores. The L3’s
CMR should, therefore be obtained empirically by measuring the throughput of the L3 cache, and dividing
the theoretical peak FLOPS of the chip by the obtained number. For the two machines we used in the
experimental section, we get that the L3 CMR of the SkylakeX processor was around 10, and for the Kaby
Lake mobile processor (i7 Macbook Pro) it was around 4 6.
Hence, on the SkylakeX processor, we need R ≥ 20, and for the mobile i7 CPU we need R ≥ 8 in order
to aim for full compute utilization at the L3 level.
For the main memory, the CMR can be easily computed as the ratio of the processors peak FLOPS and
the memory bandwidth (number of channels times the frequency times 8 bytes per transfer). Which was 35
for the SkylakeX and 13 for the i7. The size of the input is 4RT 2C, the size of the output is 4RT 2C ′, and the
arithmetic complexity is at least 2RCC ′T 2. Hence, when the input tiles are read from the main memory, and
the output tiles are written to the main memory, the compute utilization would be CC
′
2·(C+C′) ≥ min(C,C ′)/4.
Hence, when we have at least 60 input and output channels for SkylakeX architectures (and at least 24 for
i9), it is also possible to achieve full utilization at the main memory level. At least 32 or 64 channels is in
line with our previous constraints (i.e. we can afford more channels and still fit the right-hand matrices it in
L3 cache).
5.2 Fitting into L2
Recall that the shared buffer reduces our memory requirement to 4RT 2 max(C,C ′) + 4Rmin(C,C ′) bytes.
This amount is upper bounded by 4Rmax(C,C ′) · (T 2 + 1) which is quantitatively nicer to work with (and
may also allow for a more natural implementations).
The size of L2 cache size is typically 256kb bytes for AVX2 (our i7) architectures, and 1mb for AVX512
(SkylakeX) architectures. We would like the shared buffer to fit in half of the available L2 cache, and using
our upper bound we get that, for i7,
Rmax(C,C ′) · (T 2 + 1) ≤ 256 · 1024/(4 · 2) = 32kb
needs to hold. Analogous derivation for SkylakeX gives the following requirement
Rmax(C,C ′) · (T 2 + 1) ≤ 128kb.
6 Experiments
We executed the main experiments on an 18 core Intel 7980xe with 2.6ghz 4 memory channels each 21.3gb/sec,
20mb shared L3 cache and 1mb L2 per-core cache. Despite being a desktop grade machine, it provides native
AVX512 instruction set support.
We compare our implementation of L3-fused Winograd convolution to the state-of-the-art implemen-
tations of ZNN [8] and DNNL [3] (formerly known as MKL-DNN), along with our own non-fused 3-stage
implementation. We run experiments for typical convolutional layers in the VGG and ResNet networks, 4
layer for each, ranging from 64 to 512 channels. The difference is that VGG’s layers have 4 times larger
spatial dimensions. In particular, VGG layers that we experiment on are
• 64 channels and D = W = 224,
• 128 channels and D = W = 112,
• 256 channels and D = W = 56, and
• 512 channels and D = W = 28.
6The actual requirements were a bit lower, but we would like to take a conservative estimate.
8
64 128 256 512
0
4
8
12
16
Number of Channels
E
x
ec
u
ti
on
T
im
e
(m
s)
ResNet Layers
ZNN
DNNL
our non-fused
L3 fused
64 128 256 512
0
50
100
150
200
250
Number of Channels
E
x
ec
u
ti
on
T
im
e
(m
s)
VGG Layers
Figure 2: Benchmark results on 18-core Intel 7980xe
.
ResNet layers that we experiment on are:
• 64 channels and D = W = 56,
• 128 channels and D = W = 28,
• 256 channels and D = W = 14, and
• 512 channels and D = W = 7.
All these layers have kernel size 3× 3 and low and high padding of 1. We choose batch 64 which is typical
for benchmarks. Together, these layers cover typical combinations of possibilities for a range of channels and
spatial dimensions.
We let ZNN optimize its parameter space for each layer (in particular, we let it choose different tile size
and other internal parameters, such as row blocking, specifically, for each one of the 8 layer types tested).
While DNNL does not expose an interface for parameter search, we also benchmark layer types independently,
allowing the framework to make the best decisions for each one. We also pick the best times for each
layer obtained by our non-fused implementation among different configurations. Our baseline non-fused
implementation is not particularly efficient (nor is that the point of the paper), but it does get competitive
results to ZNN and DNNL for most of the tested layers.
On the other hand, we fix a reasonable configuration for the L3-fused algorithm, with R = 24 and T = 7
(i.e. 7×7 tile size), in line with our theoretical derivations, and run the algorithm with this fixed configuration
for all 8 tested layer types. This way, we demonstrate the robustness of our algorithm, and it’s superior
performance even without fine tuning parameters for each particular layer (which can only improve the
results further).
The results are shown on Figure 2 and are consistent with our theoretical predictions. L3-fused algorithm
reliably and significantly outperforms the best of all 3 other implementations on all layers with 64 and 128
channels. On 64 channel layer of ResNet (VGG), L3 fusion takes 3.16 (46.27) ms as opposed to the second
best time of 5.38 (73.04) ms of DNNL. The improvements are similar for 128 channels, while on 256 channels
9
32 64 128 256
0
20
40
60
80
100
Number of Channels
E
x
ec
u
ti
o
n
T
im
e
(m
s)
our non-fused
L3 fused
Figure 3: Benchmark results on 4-core MacBook Pro
L3-Fusion still achieves close to the best performance in both cases (actually the best performance on the
VGG layer).
6.1 i7 experiment
We also perform experiments on 4-core MacBook Pro with 3.1 GHz Intel Core i7, 1.6ghz 2 memory channels
each 12.8 gb/s, 8 mb shared L3 cache and 256kb L2 per-core cache. AVX2 instruction set is natively
supported, but avx512 instruction set is not supported.
We compare L3 fused algorithm with our baseline (DNNL and ZNN only support CPUs with avx512
instruction set), for tile size 7. Based on our calculations, we set parameter R = 8 for the L3-fused algorithm.
Since we expected the performance of the L3-fused algorithm to be the best on i7–like architectures for
smaller number of channels, we consider the following convolutional layers:
• 32 channels and D = W = 112.
• 64 channels and D = W = 56,
• 128 channels and D = W = 28, and
• 256 channels and D = W = 14
The results are presented in Figure 3 and validate our performance analysis for the i9 architecture. It
shows that L3 fusion approach is widely applicable and provides impressive performance for interesting
parameter regimes as long as a shared L3 cache is available.
7 Conclusions
While, there’s no ‘one fits them all‘ approach, we advance the state-of-the-art by providing the most efficient
algorithm for modern CPUs for executing many typical convolutional layers using transformed computation.
Instead of computing full intermediate results, that are large and get stored in the main memory, our
algorithm crucially relies on the shared L3 cache, which on modern processors is quite large, and quite likely
could increase further in the future (e.g. see [5]). We show that the L3 cache pressure depends on the number
of channels, thus, as the number of channels increases the L3-fused approach becomes inferior to the standard
non-fused one. However, the trade-off is expected to swing more towards our fused approach for upcoming
CPUs with larger shared caches.
10
In our work, we explained how to find a theoretically optimal value for the hyper-parameter R. This
parameter can be tuned. Tuning R is not hard, and can be done once and stored in a wisdom file.
It is an interesting future direction to apply the same ideas for other problems or for other types of
hardware. Another question is to see whether in some cases it could be faster to compute a convolution with
C ′ = c1 · C input and C ′′ = c2 · C ′ output channels by performing c1c2 convolutions each with C input and
C ′ output channels (and appropriately summing up the results), especially if L3-fusion can be super efficient
for each of these smaller convolutions. Of course the trade-off here is the cost of performing input transforms
c1 times and output transforms c2 times.
References
[1] Falcon library: Fast image convolution in neural networks on intel architecture. "https://
colfaxresearch.com/falcon-library/".
[2] Intel quietly kills off xeon phi. "https://www.extremetech.com/extreme/
290963-intel-quietly-kills-off-xeon-phi".
[3] Intel(r) math kernel library for deep neural networks. "https://github.com/01org/mkl-dnn".
[4] LIBXSMM. https://github.com/hfp/libxsmm.
[5] Wikichip: Crystal well. "https://en.wikichip.org/wiki/intel/crystal_well".
[6] D. Budden, A. Matveev, S. Santurkar, S. R. Chaudhuri, and N. Shavit. Deep tensor convolution on
multicores. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages
615–624, 2017.
[7] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of
the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[8] Z. Jia, A. Zlateski, F. Durand, and K. Li. Optimizing n-dimensional, winograd-based convolution for
manycore cpus. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming, (PPoPP), pages 109–123, 2018.
[9] T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM review, 51(3):455–500,
2009.
[10] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural
networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
[11] A. Lavin and S. Gray. Fast algorithms for convolutional neural networks. In Proceedings of the 29th
IEEE Conference on Computer Vision and Pattern Recognition, (CVPR), pages 4013–4021, 2016.
[12] Y. LeCun, P. Haffner, L. Bottou, and Y. Bengio. Object recognition with gradient-based learning. In
Shape, contour and grouping in computer vision, pages 319–345. 1999.
[13] X. Liu, J. Pool, S. Han, and W. J. Dally. Efficient sparse-winograd convolutional neural networks. arXiv
preprint arXiv:1802.06367, 2018.
[14] L. R. Rabiner and B. Gold. Theory and application of digital signal processing. Englewood Cliffs, NJ,
Prentice-Hall, Inc., 1975. 777 p., 1975.
[15] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv preprint arXiv:1409.1556, 2014.
11
[16] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and Y. LeCun. Fast convolutional nets
with fbfft: A gpu performance evaluation. arXiv preprint arXiv:1412.7580, 2014.
[17] S. Williams, D. Patterson, L. Oliker, J. Shalf, and K. Yelick. The roofline model: A pedagogical tool for
auto-tuning kernels on multicore architectures. In Hot Chips, volume 20, pages 24–26, 2008.
[18] A. Zlateski, Z. Jia, K. Li, and F. Durand. The anatomy of efficient fft and winograd convolutions on
modern cpus. In Proceedings of the 33rd ACM International Conference on Supercomputing, (ICS),
pages 414–424, 2019.
12
