Intelligent-Unrolling: Exploiting Regular Patterns in Irregular
  Applications by Liu, Changxi et al.
Intelligent-Unrolling: Exploiting Regular Patterns in
Irregular Applications
Changxi Liu
School of Computer Science and
Engineering, Beihang University
changxi.liu@buaa.edu.cn
Hailong Yang
School of Computer Science and
Engineering, Beihang University
hailong.yang@buaa.edu.cn
Xu Liu
Department of Computer Science,
College of William and Mary
xl10@cs.wm.edu
Zhongzhi Luan
School of Computer Science and
Engineering, Beihang University
07680@buaa.edu.cn
Depei Qian
School of Computer Science and
Engineering, Beihang University
depeiq@buaa.edu.cn
Abstract
Modern optimizing compilers are able to exploit memory
access or computation paerns to generate vectorization
codes. However, such paerns in irregular applications are
unknown until runtime due to the input dependence. us,
either compiler’s static optimization or prole-guided op-
timization based on specic inputs cannot predict the pat-
terns for any common input, which leads to suboptimal
code generation. To address this challenge, we develop
Intelligent-Unroll, a framework to automatically optimize
irregular applications with vectorization. Intelligent-Unroll
allows the users to depict the computation task using code
seed with the memory access and computation paerns rep-
resented in feature table and information-code tree, and gen-
erates highly ecient codes. Furthermore, Intelligent-Unroll
employs several novel optimization techniques to optimize
reduction operations and gather/scaer instructions. We
evaluate Intelligent-Unroll with sparse matrix-vector mul-
tiplication (SpMV) and graph applications. Experimental
results show that Intelligent-Unroll is able to generate more
ecient vectorization codes compared to the state-of-the-art
implementations.
Keywords irregular application, data access and instruc-
tion paern, code optimization
1 Introduction
With the SIMD instruction adopted on modern CPU archi-
tectures, the performance gap between CPU and memory
become even larger. Compilers have developed powerful
static analysis to accelerate applications automatically by
leveraging the SIMD units on CPU. However, it works well
only with regular applications. In addition, although the com-
plex instructions such as reduction, gather and scaer have
been supported on CPU architectures to optimize irregular
applications, the performance with compiler optimizations
is oen sub-optimal. Especially when there are the poten-
tial write conicts, the compilers usually give up on SIMD
instructions trading performance for correctness. As the
SIMD units become pervasive on modern CPU architectures,
leaving the performance on table for irregular applications
that take a large portion of scientic applications becomes
unacceptable.
e regular applications can be optimized by static analy-
sis of compilers for their memory access and instruction pat-
tern. However, for irregular applications, the memory access
and instruction paern can only be analyzed during runtime.
erefore, the compilers fail to identify the performance
opportunity for irregular applications. For instance, on the
SIMD architecture, the compiler fails to optimize the program
when confronting the potential write conicts. However, if
the runtime behavior of the data accesses can be identied,
then we can solve the write conicts for beer parallelization
using the SIMD units. Another example of compiler inca-
pability at optimizing irregular applications can be found
at the instruction level. For the gather/scaer/reduction in-
structions that are widely used in irregular applications, if
we can organize the runtime data accesses are continuous
or in the same vector lane, then we can replace the above in-
structions with load and permutation instructions for beer
performance.
However, there are several challenges to realize the po-
tential performance opportunities for irregular applications
that cannot be provided by compilers. Firstly, dierent from
regular applications, the memory access and instruction pat-
tern varies signicantly across dierent irregular applica-
tions. erefore, a general approach should be proposed
to adapt to the various behaviors of irregular applications.
Secondly, naively unrolling the instructions of irregular ap-
plications could lead to memory bloat that prevents further
performance optimization. It is mandatory to constrain the
memory occupancy when analyzing the runtime behaviors
of irregular applications. irdly, the optimization method
for irregular applications should be able to adapt to various
underlying architectures in order to improve its practical
adoption.
To address the above challenges, we propose Intelligent-
Unroll, a framework for optimizing irregular applications
ar
X
iv
:1
91
0.
13
34
6v
1 
 [c
s.D
C]
  2
4 O
ct 
20
19
on SIMD architectures automatically. ere are three im-
portant components in Intelligent-Unroll, including code
seed, feature table, and information-code tree.e design of
Intelligent-Unroll is easily extensible by adding new features.
Intelligent-Unroll have already integrated several optimiza-
tion techniques for reduction, gather and scaer instructions
for beer performance. When evaluating with representa-
tive workloads, Intelligent-Unroll is able to generate more
ecient codes on various SIMD architectures compared to
the state-of-the-art implementations.
Specically, this paper makes the following contributions:
• We propose Intelligent-Unroll, a framework that iden-
ties the regular paerns within irregular applica-
tions, and automatically optimize the instruction and
data synthetically by generating more ecient codes.
• We propose several techniques such as code seed,
feature table and information-code tree to identify
the opportunity to replace the reduction instructions
with load instructions, and the gather instructions
with instruction group of load, shue and select in-
structions for beer performance.
• We evaluate with representative workloads such as
SpMV and PageRank on KNL and Intel Xeon CPUs.
e experiment results demonstrate that the codes
automatically generated by Intelligent-Unroll achieve
beer performance than the state-of-the-art imple-
mentations.
e remainder of this paper is organized as follows. Sec-
tion 2 presents the background of the irregular application
and corresponding optimization methods. Section 3 de-
scribes the motivation of our work. Section 4 presents the
design overview of Intelligent-Unroll. Section 5 and Section
6 describes the implementation details of the optimizations
on reduction and gather operators. Section 7 presents the
evaluation results of SpMV and PageRank compared to the
state-of-the-art implementations. Section 8 presents the re-
lated work in the eld, and section 9 concludes this paper.
2 Background
2.1 Understanding Irregular Applications
Irregular applications are common in both traditional re-
search elds such as high performance computing and emerg-
ing research elds such as big data analysis and deep learning,
which exhibits a constant demand for higher performance.
e dierence between irregular and regular applications
is that whether the paerns of data access and instruction
can be known before runtime. For irregular applications,
the above paerns is strongly correlated with the input data
and can only be known during runtime. Such uncertainty of
irregular applications introduces diculties such as irregular
memory accesses, unbiased branches and writing conicts
for compiler optimization.
For irregular applications, there are two important con-
cepts to describe their data access and instruction paerns
such as access arrays and data arrays [13]. Algorithm 1
presents two code example of irregular applications. We
can see that the access arrays contain the indirect access
or branch execution sequence (line 2 and line 6). Whereas
the data arrays are almost accessed indirectly through the
access arrays (line 3). Another code example of irregular
applications is the inference process of the sparse neural
networks [20, 25], although the data arrays during the in-
ference can be updated, the access arrays are immutable
or updated infrequently. e above observations inspire us
to design a mechanism for uncovering the potential perfor-
mance of irregular applications and applying corresponding
optimization automatically.
Algorithm 1 e code examples of irregular applications
1: function Irregular Memory Access
2: idx ← Load access array[...]
3: data ← Load data array[idx ]
4: function Unbiased Branches
5: cond ← Load access array[...]
6: if cond then
7: ...
2.2 Optimizing Irregular Applications
e performance gap between CPU and memory is still in-
creasing. Although multi-level memory hierarchies are in-
troduced to hide memory access latency, it still cannot catch
up with the instruction level parallelism developed in hard-
ware such as SIMD, multi-stage pipeline and out-of-order
execution. For regular applications, the compilers can gen-
erate ecient instructions such as AVX512 through static
analysis of the program paerns for optimized performance.
However, with irregular applications, the compiler optimiza-
tion is quite restricted due to the unknown data access and
instruction paerns that can only be determined during run-
time. For instance on the SIMD architectures, to ensure the
correctness, the compilers perform almost no vectorization
of irregular applications if there are potential memory write
conicts. e conservative optimization strategy of existing
compilers wastes the opportunities to exploit the regular
paerns within irregular applications for performance opti-
mization.
Similar to regular applications, the optimization of irregu-
lar applications also focus on the temporal and spatial reuse
of data, as well as parallel eciency. ere are plenty of
research works proposed to adapt irregular applications to
underlying architectures. However, most of the above works
require tremendous engineering eorts and cannot be easily
ported to other architectures. Such ad-hoc optimizations
are unsustainable as new architectures and applications are
developed at unprecedented rate especially in the emerging
domains such as deep learning. In addition, the optimization
2
for (i = 0, offset = 2; i < 8; i += offset)  Load  A[ i ] 
Gatherv4 A, {0,2,4,6}
A C E G
A B C D E F G H
A B C D E F G H
Loadv4 A Loadv4 A + 4 
shuffle
A C E G
B = {0,0,2,5} for (i = 0 ; i < 4; i ++)   Load A[ B[i] ]
Gatherv4 A, B
A A C E
A B C D E F G H
A B C D E F G H
Loadv4 A Loadv4 A + 4 
shuffle
A A C E
Method 1
Method 2
Code MemoryReg
Method 1
Method 2
(a) (b)
Figure 1. e memory access optimization of regular appli-
cation (a), and irregular application (b).
of irregular applications has also been studied in domain
specic compilers such as Halide [26], Tensor Comprehen-
sions [28] and TVM [7, 12]. ese studies provide ecient
way to generate high performance code for special appli-
cation domains. ese domain specic compilers motivate
our work to design a compilation framework for irregular
applications that can analyze the data access and instruc-
tion paerns to generate ecient code automatically. We
choose LLVM [19] as our compilation backend, because the
JIT APIs in LLVM allow us to analyze the execution paerns
and generate optimized code at runtime.
3 Motivation
e memory access paerns of regular applications have
already been optimized by the compilers using static anal-
ysis [1, 24]. However, the memory paerns of irregular
applications are always dictated by the data being processed,
which can only be known during runtime. erefore, the
existing compilers fail to optimize the performance for such
irregular applications.
e memory access paern usually has a signicant im-
pact on the performance. However, using the existing com-
piler optimizations sometimes lead to suboptimal perfor-
mance for irregular applications. For instance, when loading
the data from discontinuous memory addresses, the com-
pilers alway generate gather instruction for the memory
load. However, as shown in the case of Figure 1, replacing
the gather instruction (Method 1) with vload instruction
(Method 2) achieves beer performance. In the case of the
regular application as shown in Figure 1(a), the compilers can
automatically perform the above optimization through static
analysis. Whereas with irregular applications as shown in
Figure 1(b), since the memory access paern can only be rec-
ognized during runtime, the compilers generate inecient
codes that load data from memory using gather instruction.
Moreover, existing compilers are incapable to generate
ecient code for the calculation of irregular applications.
For instance, to utilize the vector units on SIMD architec-
tures, the calculation dependencies need to be identied for
for (i = 0 ; i < 4; i ++)  A[i+2] = OP (A[i])
OP
A B A’ B’ A’’ B’’
B = {3,2,6,5} for (i = 0 ; i < 4; i ++)  A[B[i]] = OP(A[i])
Method 1
Method 2
Code MemoryReg
1
2 3
4
OP
1 2
OP
A B B’ A’ B’’ A’’
Method 1
Method 2
1
OP
1 2
2 3
4
SWAP
(a) (b)
Figure 2. e calculation optimization of regular application
(a), and irregular application (b).
correct vectorization. For regular application as shown in
Figure 2(a), compilers can identify the calculation dependen-
cies with static analysis and then generate ecient code. For
instance, the operation 1 , operation 2 , operation 3 and
operation 4 are independent from each other. en, the
compilers can leverage such information (Method 2) to opti-
mize performance. However, when dealing with the irregular
application in Figure 2(b), the compilers have to assume that
the calculations have dependencies with each other to ensure
correctness. Whereas the optimization (Method 2) in Fig-
ure 2(b) indicates that operation 1 and operation 2 can be
processed in parallel, and then operation 3 and operation 4
can be processed in parallel, which leads to beer perfor-
mance. Unfortunately, such optimization opportunity of
irregular applications cannot be identied by compilers us-
ing static analysis.
e above observations indicate that there is a huge space
for performance optimization of irregular applications that
cannot be achieved by compilers using static analysis. Such
performance opportunity within irregular applications can
only be identied during runtime that involves both memory
accesses and calculation instructions.
However, naively unrolling the instructions of irregular
applications and then applying optimizations could easily
generate formidable code space, that leads to the instruction
bloat problem. In addition, if we use condition statements to
select the optimal instructions, the application performance
could degrade signicantly due to the branch mis-prediction
caused by the condition statements. Moreover, empirically
writing specic code for each condition is also impractical,
which requires tremendous engineering eorts. For instance,
If the conditions to be optimized are (k1,k2,k3...), then the
number of code to be wrien is (k1 × k2 × k3 × ...).
To overcome the above problems, we propose Intelligent-
Unroll, a framework that allows users to provide a code
seed to describe the calculation process of the program.
Intelligent-Unroll then automatically generates ecient in-
structions for the program. Specically, Intelligent-Unroll
can identify the regular instruction paerns and optimize
them with ecient instructions. To accomplish above goals,
3
Intelligent-Unroll provides corresponding techniques to tackle
the following challenges:
• How to leverage the code seed to describe diverse
data access and instruction paerns?
• How to adapt instructions to the behaviors of data
accesses for beer performance?
• How to optimize the instruction and data access syn-
ergistically without violating the correctness?
4 Intelligent-Unroll: Overview
Intelligent-Unroll is designed to identify the regular data ac-
cess and instruction paerns hidden deeply within irregular
applications. e goal of Intelligent-Unroll is to automat-
ically optimize the instruction and data synthetically for
identied performance opportunities.
e design overview of Intelligent-Unroll is shown in
Figure 3. e users only need to describe the calculation
process using a lambda expression with its input data, and
then Intelligent-Unroll interprets calculation expression and
automatically generates an ecient implementation for a
particular architecture. e data of the computation task
is classied to mutable data and immutable data. e im-
mutable data, that is unchanged during the execution of the
task, will be used to generate information for the optimiza-
tion process. For the optimization process, Intelligent-Unroll
rstly interprets the lambda expression and generates the
code seed. e instruction paerns contained in the code
seed as well as the immutable data are used by the Informa-
tion Producer (Figure 3 (a)) to generate the Feature Table
(Figure 3 (b)), which includes the information required for
further optimization.
e Code Seed describes the calculation process with-
out concerning about the optimization. Based on the Code
Seed, the Information Producer extract the calculation pat-
terns to generate the Feature Table. And Code Optimizer
and Data Transfer modules use the Code Seed to generate
optimized code. Each column of the Feature Table is the cal-
culation process for one iteration, and the row represents
the iterations. Each element in the Feature Table describes
the instruction feature at the current iteration. Each column
of the Feature Table is denoted as opsk , where k is the k-th
order. e Feature Table helps us handle various paerns
in the irregular applications. We can merge instructions to
optimize the execution based on the information provided
by Feature Table.
e Code Optimizer and Data Transfer modules in the
Information Producer then process the Feature Table to gen-
erate the Intermediate Representation (IR) code that is in-
dependent from the underlying architecture. Eventually,
Intelligent-Unroll lowers the the code implementation to
LLVM to generate the machine instruction regarding the
target architecture.
e design of Code Optimizer and Data Transfer modules
is shown in Figure 3(c). Firstly, the hash value of each col-
umn in the Feature Table is generated. e columns with
the same hash value exhibits the same calculation paern.
Intelligent-Unroll merges the columns with the same hash
value to generate a hash map. is hash map combines the
instructions with the same calculation paern, and thus de-
ceases the memory occupancy during instruction unrolling.
Aer combining instructions, the Intelligent-Unroll con-
tinues to process the hash map to merge instructions with
the same write location. Figure 4(a) shows an example of two
instruction groups writing to the same location. Without
merging the instructions, two reduction operations in addi-
tion to two read and write operations to the Write Addr are
required, which wastes computation resources and memory
bandwidth. Figure 4(b) shows the calculation paern aer
merging the instructions. We can see that only one reduction
operation is required. Although in this case we introduce
one extra vector operation, it is far more ecient than reduc-
tion operation. Eventually, the optimized instructions are
generated by the Optimization Pass and Rearrange Opti-
mization Info modules, the details of which are described
in Section 5 and Section 6.
5 Reduction Instruction
e reduction instruction is a frequently used in programs.
However, the reduction instruction encounters the instruc-
tion dependency problem on the SIMD architecture for par-
allelization. Traditional compilers degrades to SISD instruc-
tions because it fails to identify the dependency using static
analysis. e pseudo-code shown in Figure 5 serves as an
example. However, naively applying vectorization could lead
to incorrect results, for example more than two operators
writing to the same location in one SIMD instruction.
In Intelligent-Unroll, it can analyze the write locations and
rearrange the calculation to avoid write conicts. However,
changing the original calculation order may jeopardize the
correctness of the program, therefore we need to make sure
the correctness is not aected by the calculation rearrange-
ment. e analysis of the calculation rearrangement in terms
of program correctness is as follows.
e reduction operator is both associative and commu-
tative. We dene the reduction operator as ∗, and thus an
example of reduction operation can be expressed as res =
p1 ∗ p2 ∗ p3 ∗ p4. e expression can be transformed to
res = (p1 ∗ p2) ∗ (p3 ∗ p4) based on the associative property.
erefore, we can calculate res1 = p1 ∗p2 and res2 = p3 ∗p4
in parallel, and then calculate res = res1 ∗ res2. It is clear for
the reduction operation that we reduce the partial results
in parallel and then reduce partial results to derive the nal
results.
4
vop0
vop1
gather
●●●
Add
scatter
⸺⸺ ⸺⸺ ●●● ⸺⸺ ⸺⸺
⸺⸺ ⸺⸺ ●●● ⸺⸺ ⸺⸺
info31 info32 ●●● info3(n-1) info3n
●●● ●●● ●●● ●●● ●●●
info(m-1)1 info(m-1)2 ●●● info(m-1)(n-1) info(m-1)n
infom1 infom2 ●●● infom(n-1) infomn
cycle
op
er
at
io
ns
op
er
at
io
ns
opsnops1
Co
de
 S
ee
d
Co
de
 S
ee
d
Info
Producer
Code
Optimizer
and
Data
Transfer
●●● ●●●
LLVM
Deployable 
Module
●●●
Lamda
Data
mutable immutable
(a) (b)
hash1=h1 hash2=a2 ●●● hashn-1=a1 hashn=ak
Hash Function
hash1 (i1,j1), (i2,j2),i3…
hash2 ●●●
●●● ●●●
hash(k-1) ●●●
hashk ●●●
Merge the same calculation patter
hash1 1,…,n-1
hash2 2
●●● ●●●
hash(k-1) ●●●
hashk …,n
Merge
the 
same
write 
location
●●●
gatherOPT
●●●
AddOPT
scatterOPT
Pass
Optimization
Generate
Optimization Info
Rearrange
●●●
gatherINFO
●●●
AddINFO
scatterINFO
Used
Parse
(c) Code Optimizer and Data Transfer
Figure 3. e design overview of Intelligent-Unroll, which includes (a) information producer, (b) feature table and (c) code
optimizer and data transfer.
I0 I1 I1 I0
Access
Array
R0 R1 R2 R3Data
Array
W0 W1
Write
Addr
I0i
●
●
●
R0’ R1’ R2’ R3’
Data
Array2
I0 I1 I1 I0
Access
Array2
W0’ W1’
Write
Addr
I0j
Reduction + Scatteri
I0 I1 I1 I0
Data
Array
R0 R1 R2 R3
Data
Array2
Access
Array
(b)
Op
Result
R0’ R1’ R2’ R3’
W0 W1
Write
Addr
I0i
(a)
R0’’ R1’’ R2’’ R3’’
j
Figure 4. An example of merging same location instruction
groups together (a) the instruction merged before. (b) the
instrcution merged aer
5.1 Generating Information
Instead of generating code by the distribution of write loca-
tions, we generate various reduction operators by the num-
ber of reduction operations required. On the SIMD archi-
tecture whose vector length is N , we need loд(N ) reduction
instructions at most to complete a SIMD reduction operator.
We denote a ag of the reduction operator, which ranges
0, 1, 2, ..., loд(N ). For example, when the ag of the reduction
operator is M , it means that we need M reduction instruc-
tions to complete the SIMD reduction operator.
In addition to the ag, we also need other information.
When the ag is M , it requires M vector, whose dimension
is N and the bit width of each element is loд(N ). e above
information represents the source location of the data to be
reduced. As shown in Figure 5(a), R3 requires a reduction
operator with R0, R1 and R2 each. erefore, the shue
address is 3 and 2, and R3 and R2 are moved to the rst and
second location of the shue data. We can reduce the shue
data and then the rest of the data together to derive the nal
results. When the ag is N , we can also choose the reduction
operator supported by the architecture if it is available.
5.2 Identifying Code Generation Pattern
e commonly used reduction operators include add and
multiply. For other reduction operators such as minus, divi-
sion, we can transform them to add or multiply reduction
operators with negative variance operators.
e code seed generated does not consider the write con-
icts and the optimization pass module aer will process it.
Intelligent-Unroll identies the source instruction that pro-
vides the write variance of scaer instructions. e reduction
5
I0 I1 I1 I0
Access
Array
R0 R1 R2 R3
Data
Array
2 bits
W0 W1
Write
Addr
I0
3 2 - -Perm
Addr
Perm
Array R3 R2 -- --
1
1
2
C0 C1 -- --
2
3
Reduction
Array
Op Destination Sources
●●● ●●● ●●●
Add Res Res, v2
●●● ●●● ●●●
Scatter DesAddr Res
Op Destination Sources
●●● ●●● ●●●
Reduction v2Reduce v2
Add Res Res,v2Reduce
●●● ●●● ●●●
Scatter DesAddr Res
Reduction Add Mult ●●●
Belongs to
(a) (b)
Figure 5. An example of reduction operator (a) and, (b)
corresponding code generation paern.
Table 1. e comparison of the instructions before and aer
the optimization of reduction operator.
Calculation Reduction Permulation
original N N 0
optimized 1 M M
Table 2. e comparison of the data size before and aer
the optimization of reduction operator.
vload vstore
Write Index Write Data Additional Data Write Data
original N * Bit(Index) N * Bit(Data) – N * Bit(Data)
optimized M * Bit(Index) M * Bit(Data) M * Bit(Info) M * Bit(Data)
processing module is activated to insert several reduction
operations before scaer instructions, if the operation type
of the source instruction belongs to the reduction operators.
Intelligent-Unroll will generate reduction instructions ac-
cording to the information in the column of Feature Table
corresponding.
As shown in Figure 5(b), the Res, which is the value writ-
ten by Scaer instruction, is provided by an Add operation,
which belongs to reduction operators. Activated by this con-
dition, Intelligent-Unroll inserts a reduction operation before
the Add instruction and then redirects the result to the Add
instruction, which is the operation 1 and 2 in the Figure 5.
5.3 Instruction and Memory Eciency
Intelligent-Unroll generates optimized codes for the original
program. Table 1 provides a comparison of the instructions
before and aer the optimization. With Intelligent-Unroll,
we can reduce the number of calculations on the reduction
data from N to 1, and the number of reduction operations
from N to M, where M is less than or equal to loд2N . Al-
though Intelligent-Unroll introduces additional operations
such as Permulation, it can still accelerate the calculation
process if executing M shue operations is faster than the
sum of (N-1) calculations and (N-M) reduction operations.
D0 D4 D5 D1
Access
Array
A B C D E F
Data
Array
Gather A E F B
D0
D0 D4
0
0
1
1
5
A B C DLoad
Info
A A B B
E F G H
2 bits
0 1 1 0
1bit
A E F BSelect
(a)
Op Destination Sources
Gather Res ●●●
●●● ●●● ●●●
Op Destination Sources
Load Load1 ●●●
Permutation Perm1 Load1
Load Load2 ●●●
Permutation Perm2 Load2
Select Res Perm1,Perm2
●●● ●●● ●●●
(b)
Perm
E E F F
1
2 2
3 Permulation
Address
Perm
Figure 6. An example of gather operator (a) and, (b) corre-
sponding code generation paern.
on
ly 
1 l
oa
d
les
s t
ha
n 2
 lo
ad
les
s t
ha
n 3
 lo
ad
les
s t
ha
n 4
 lo
ad
les
s t
ha
n 5
 lo
ad
les
s t
ha
n 6
 lo
ad
les
s t
ha
n 7
 lo
ad
les
s t
ha
n 8
 lo
ad
0.0
0.25
0.5
0.75
1.0
Ra
tio
0-25%
25%- 50%
50%-75%
75%-100%
Figure 7. e distribution of gather instructions that can be
replaced by instruction group of vload, permulation.
Intelligent-Unroll also changes the memory access paern.
From Table 2 we can see, it avoids the redundant memory
load and store to the write data, whose size is (N − M) ×
Bit(Data). In addition, Intelligent-Unroll also eliminates
unnecessary load to the index of write address, whose size
is (N − M) × Bit(Index). However, Intelligent-Unroll also
introduces extra overhead. e additional data that is used
by the shue instructions is M × loд2Nbits . erefore, the
performance of memory access can be optimized if the size
of additional data is less than the sum of the write data size
aer optimization.
6 Gather and Scatter Instructions
6.1 Understanding the Opportunity
Gather and Scaer instructions are also frequently used in
programs on SIMD architectures. We observe that replacing
the gather instruction with group of vload and permutation
instructions achieves beer performance in several cases.
Similar performance improvement is also observed by re-
placing scaer instruction with group of permutation and
store instructions. Since the method of optimizing gather
and scaer instructions is similar, we only present the op-
timization method of gather instruction in the following.
Unlike the reduction instruction, the sparsity paern of
the data aects the performance opportunity when optimiz-
ing the gather operator. For instance, if the sparsity of the
data is entirely random, there is hardly a chance to achieve
6
beer performance. Fortunately, most of the sparse data
exhibit regular distribution to some extent. Figure 7 shows
the percentage of sparse datasets that achieve beer perfor-
mance when replacing the gather instructions with vload
instructions. e sparse datasets include 2,700 matrices from
the SuiteSparse Matrix Collection [11]. e x axis in the
gure indicates the number of vload instructions, and the
y axis indicates the percentage of the entire datasets. e
legend of the Figure 7 represents the percentage of the gather
instructions within the execution on a particular dataset.
From Figure 7 we can see that the datasets, with more
than 25% of the gather instructions can be replaced by one
vload instruction, accounts for 18.4% of the entire datasets.
Whereas, 46.9% of the datasets contain more than 25% of the
gather instructions that can be replaced with no more than
two vload instructions. Moreover, 55.0% of datasets contain
more than 75% of the gather instructions that can be replaced
with four vload instructions. It is clear that there is a large
performance space by optimizing the gather instructions of
irregular applications with sparse data.
6.2 Generating Information
Similar to the optimization of reduction operator, we use
a ag to denote the number of vload instructions, and the
largest value of the ag is vector length of the architecture.
And the same to the reduction operators, the optimization
of gather instructions also need additional information and
the bit width of each element in the address vector and the
length of the vector is the same as reduction operator. e
dierence from reduction operator is that we use only one
Permulation Address regardless the value of the ag. To de-
termine the permulation instruction that data in the address
vector belongs to, we use additional mask vector whose num-
ber is (f laд − 1). Several begin addresses are also required
whose value equals to the ag in order to guide the vload
instructions.
e Figure 6 gives an example of optimizing gather in-
structions. e Figure 6(a) is an example of gather operator,
where the vector length is four, the bit width of the shue
vector is two, and the length of vector is four. In this example,
we use two vload instructions to replace one gather instruc-
tion. erefore, the value of the ag is two, and the number
of the mask vector is one. First, we load data ABDC and
EFGH in the registers using the begin addresses D0 and D4.
en, based on the Permutation Address and ABCD,EFGH,
we obtain AABB, EEFF by permutation instruction. Aer
that, we obtain AEFB with AABB, EEFF with mask 0110
using the select instruction.
6.3 Identifying Code Generation Pattern
To optimize the gather instructions, we replace the gather
instructions with vload, permutation and select instructions.
When scanning the code, we consult the column of feature
Table 3. e comparison of the data size before and aer
the optimization of gather operator.
Index Data Additional Info
original N * Bit(Index) N * Bit(Data) –
optimized M * Bit(Index) M * N * Bit(Data) N ∗ loд2N + (M − 1) ∗ N
Algorithm 2 e code snippet of SpMV in CSR format
1: for i ← 0, m do
2: for j ← row ptr [i], row ptr [i + 1] do
3: y[i] ← y[i] + value[j] × x [col ptr [j]]
Algorithm 3 e code snippet of PageRank
1: for i ← 0, nedдes do
2: sum[n2[j]] ← sum[n1[j]] + rank [n1[j]] / nneiдhbot [n1[j]]
table corresponding to determine whether there is perfor-
mance benet by replacing the gather instruction with the
instruction group (e.g., vload, permutation and select ). en,
Intelligent-Unroll performs the code transformation to gen-
erate the optimized code. Figure 6(b) shows an example of
the code generation for gather operator. e instructions
including multiple vload, permutation and select instructions
is used to replace the original gather instructions. And if the
ag value equals to one, it only requires vload and permuta-
tion instructions.
6.4 Memory and Eciency
As shown in Table 3, aer our optimization of gather opera-
tor, the number of index data avoided to be loaded is N −M .
However, our optimization introduces (M − 1) × N extra
data to be loaded as well as N × loд2N + (M − 1) × N bits to
record the additional information. In addition to the memory
load overhead, our optimization also requires M instruction
groups of vload, permutation and select instructions.
Fortunately, on the cache hierarchy of modern processor,
the number of cache lines consumed by our method is the
same as the original gather instruction. In addition, the size
of the extra data introduced by our method is always smaller
than the size of index data eliminated. Since our method
is eective when the performance improvement with the
optimized gather operator outweighs the overhead due to
the extra data, we apply the optimization only when the ags
indicate there are performance benets.
7 Evaluation
7.1 Experiment Setup
We evaluate Intelligent-Unroll on two representable bench-
marks, Sparse Matrix-Vector Multiply (SpMV) and PageRank.
e code snippets of SpMV and PageRank are shown in Al-
gorithm 2 and Algorithm 3 respectively. We choose these
two benchmarks due to their unique memory and calcula-
tion paerns. From Algorithm 2, we can see that in SpMV
it always writes to the same memory location. Whereas
7
Table 4. e platform and benchmark evaluation approach. All experiments are done with single thread.
e platform SpMV evaluation approach PageRank evaluation approach
Intel Phi 7210
(64 cores@1.3GHz,2.66 DP TFlops,
16GB MCDRAM,400GB/s bandwidth,
384GB DDR4,102.4Gbit/s bandwidth).
(1) e CSR-based SpMV compiled by ICC.
(2) e CSR-based SpMV with Intel MKL version 2019 Update 3.
(3) CSR5-base SpMV[21].
(4) e code generated by Intelligent-Unroll
.
(1) PageRank compiled by ICC.
(2) e method proposed by Peng Jiang[14].
(3) e code generated by Intelligent-Unroll
.
Intel Xeon E5-2620 v3
(6 cores@2.4GHz,230.40 DP GFlops
4 × DDR4,59 GB/s bandwidth).
(1) e CSR-based SpMV compiled by ICC.
(2) e CSR-based SpMV with Intel MKL version 2019 Update 3.
(3) CSR5-base SpMV[21].
(4) e code generated by Intelligent-Unroll
. (1) PageRank compiled by ICC.(2) e code generated by Intelligent-Unroll.
Table 5. e datasets used by SpMV and PageRank.
Benchmark Dataset row×col nnz nnz/row
SpMV
Dense 2K×2K 4.0M 2K
FEM Ship 141K×141K 7.8M 55
dc2 117K×117K 766K 7
mip1 66K×66K 10.4M 155
Webbase1M 1M ×1M 3.1M 3
Wind Tunnel 218K×218K 11.6M 53
CirCuit 171K×171K 959K 5
QCD 49K ×49K 1.9M 39
PageRank
amazon0312 401K×401K 3.2M 8K
higgs-twier 457K×451K 15M 33K
soc-pokec 1.6M×1.6M 31M 19.3
for PageRank in Algorithm 3, it exhibits a random memory
write paern. In addition, the calculation paern of SpMV
is represented by explicit reduction operations, whereas the
reduction operations in PageRank are implicit.
e experiment platform is an Intel Xeon Phi CPU (KNL)
and an Intel Xeon CPU. e details of the platform and eval-
uation approach are shown in Table 4. e CPU machine is
installed with 64-bit Ubuntu v16.04, whereas the KNL ma-
chine is installed with CentOS 7.4. e icc v19.0.3 and LLVM
v8.0.0 are installed on both machines. For SpMV, we com-
pare to the implementations using CSR5 [21] and MKL in
addition to the default compiler optimization. For PageR-
ank, we compare to the implementation using conict-free
method [14] on KNL in addition to the default compiler op-
timization. We omit the results of conict-free method on
CPU since it does not support CPU architecture. e default
compiler optimization of SpMV and PageRank uses icc (-O3
-Xhost) that serves as our baseline. For each run, we execute
the benchmark for 1,000 times, and measure the average
execution time. Every experiment is evaluated for 10 times
and the best result is reported.
We select eight datasets from the University of Florida
Sparse Matrix Collection to evaluate SpMV. e datasets
include regular matrices such as Dense and QCD, as well
as irregular matrices such as mip1 and Webbase-1M. e
datasets for evaluating PageRank are adopted from [14]. e
details of the evaluation datasets are shown in Table 5.
7.2 Performance Opportunity Analysis
In Table 6, we present the percentage of gather/scaer/reduction
instructions that can be replaced by load/store(L/S) and vec-
tor (Op) instructions for the two benchmarks under dier-
ent datasets. e second column in Table 6 indicates the
Algorithm 4 e PageRank dened in Intelligent-Unroll
1: input :
2: int * n1, int * n2, double * rank , double * nneiдhbor
3: output :
4: double *sum
5: lambda i :
6: sum[n2[i]] ← sum[n2[i]] + rank [n1[i]] × nneiдhbor [n1[i]]
number of load/store/vector instructions that should be used
to replace the original gather/scaer/reduction instruction.
We do not include the results of the scaer instruction in
SpMV, since they can be optimized by the statical analysis
of compiler. e higher value of L/S means the higher cost
of replacing the gather/scaer instruction. Whereas Op = 0
means all reduction instructions can be replaced with vector
instructions, and Op = 3 means using the reduction instruc-
tion supported by underlying architecture achieves beer
performance.
e SpMV running on the Dense dataset illustrates a per-
fect case for instruction optimization in Table 6, where each
of its gather instructions can be replaced with only one load
instruction. In addition, we can optimize the reduction oper-
ations (Op = 3) with the reduction instruction provided by
underlying architectures. ere are also some cases where
there is hardly any performance opportunity with Intelligent-
Unroll, such as Webbase-1M and textitCirCuit whose L/S = 1
is less than 6%. Compared to SpMV, the datasets of PageRank
are more irregular. With L/S = 1, the percentage of replace-
able instructions is less than 51%. And even with L/S = 8, the
percentage is no more than 44.8% (e.g., higgs-twier dataset).
7.3 PageRank
e code snippet of PageRank shown in Algorithm 3 can be
dened using Intelligent-Unroll as Algorithm 4. e keyword
input (line 1-2) and output (line 3-4) dene the inputs and
outputs of the PageRank algorithm respectively. e lambda
expression species the calculation details (line 5-6). Based
on Algorithm 4, we can generate an implementation from
Intelligent-Unroll for the PageRank algorithm.
Table 7 shows the performance comparison of PageR-
ank implemented using the methods of Intelligent-Unroll,
conict-free and default compiler optimization on KNL and
CPU. We can see that the implementation optimized by
our method achieves beer performance across almost all
datasets on both KNL and CPU. Our method improves the
8
Table 6. e percentage of the gather/scaer/reduction instructions that can be optimized by the load/store operation
instructions for both SpMV and PageRank benchmarks across dierent datasets. e results are evaluated on CPU processor
with vector length of 8.
Benchmark SpMV PageRank
Detaset Dense FEM Ship dc2 mip1 Webbase1M Wind Tunnel CirCuit QCD amazon0312 higgs-twier soc-pokec
Gather&&Scaer
L/S = 1 100% 15.1% 14.8% 92.5% 5.6% 61.7% 2.6% 40.3% 50.2% 50.9% 50.2%
L/S = 2 0% 84.9% 9.4% 1.5% 53.0% 37.8% 22.5% 45.8% 1.3% 0% 0.5%
L/S = 3 0% 0% 15.7% 1.2% 18.1% 0.5% 28.9% 13.9% 5.0% 0% 0.9%
L/S = 4 0% 0% 23.6% 1.3% 11.0% 0% 22.9% 0% 10.0% 0.1% 1.5%
L/S = 5 0% 0% 20.7% 1.7% 5.4% 0% 14.6% 0% 12.1% 0.2% 2.4%
L/S = 6 0% 0% 10.1% 1.3% 3.0% 0% 6.5% 0% 11.1% 0.7% 4.1%
L/S = 7 0% 0% 4.4% 0.1% 1.6% 0% 1.6% 0% 7.2% 4.2% 9.0%
L/S = 8 0% 0% 1.3 % 0.4% 2.3% 0% 0.4% 0% 3.1% 44.8% 31.4%
Reduction
Op=0 0% 0% 6.5% 0.6% 31.8% 1.0% 18.4% 2.6% 92.0% 100% 100%
Op = 1 0% 2.6% 5.8% 0.7% 32.5% 2.8% 18.4% 2.6% 8.0% 0% 0%
Op = 2 0% 4.7& 9.8% 1.3% 8.1% 3.7% 32.5% 5.1% 0% 0% 0%
Op = 3 100% 92.7% 77.9% 97.4% 27.6% 92.5% 30.7% 89.7% 0% 0% 0%
Table 7. e performance of PageRank across dierent datasets on KNL and CPU.
KN
L
CP
U
0
0.2
0.4
0.6
0.8
1
GF
lo
ps
ICC
Conflict-Free
Our Method
KN
L
CP
U
0
0.2
0.4
0.6
0.8
1
GF
lo
ps
ICC
Conflict-Free
Our Method
KN
L
CP
U
0
0.2
0.4
0.6
0.8
1
GF
lo
ps
ICC
Conflict-Free
Our Method
amazon0312 higgs-twier soc-pokec
performance of PageRank by 4.8% on average (11.6% on max-
imum) compared to the baseline on CPU, and by 30.2% and
146.0% on average (68.5% and 158.8% on maximum) com-
pared to the conict-free and baseline methods respectively
on KNL.
On KNL, with higgs-twier and soc-pokec datasets, our
method achieves similar performance to conict-free method,
both of which is beer than the baseline. is is because icc
cannot optimize PageRank due to the potential write con-
icts. However, on amazon0312 dataset, the performance
of our method outperforms the rest by a large margin. is
is because the percentage of instructions with Op = 1 on
amazon0312 dataset is more than 8%, which are also ran-
domly distributed during the execution. Such random dis-
tribution degrades the eectiveness of branch prediction in
the conict-free method. However, in Intelligent-Unroll, the
code is directly generated for each branch condition without
predicting during runtime. erefore, our method outper-
forms the conict-free method on amazon0312 dataset.
On CPU, with amazon and higgs-twier datasets, the per-
formance using Intelligent-Unroll is beer than the default
compiler optimization. However on soc-pokee dataset, the
code generated by Intelligent-Unroll is slower than the code
optimized by icc. is is because the vector length is quite
limited (e.g., 8 in single precision) on CPU that outsets the
performance benet when replacing the reduction instruc-
tions with vector instructions.
e reason why our method achieves beer performance
than the conict-free method on KNL can be aributed to
two folds: 1) our method generates the code for each data
access and instruction paern of PageRank. erefore, it
Algorithm 5 e SpMV dened in Intelligent-Unroll
1: input :
2: int * row ptr ,int * col ptr ,double * x ,double * value
3: output :
4: double *y
5: lambda i :
6: y[row ptr [i]] ← y[row ptr [i]] + value[i]×x [column ptr [i]]
avoids paern prediction during runtime and thus improves
performance; 2) in addition to use SIMD instruction, our
method also replaces the gather/scaer instructions with
load/store instructions. As shown in Table 6, the percentage
of instructions with L/S = 1 across all datasets is larger than
50%, which presents signicant opportunity for performance
improvement of PageRank.
7.4 SpMV
e baseline SpMV implementation uses CSR format, be-
cause it decreases memory usage and provides more oppor-
tunity for compiler optimization. However in Intelligent-
Unroll, we use COO instead of CSR which ts well with
our optimization method. Algorithm 5 denes SpMV using
Intelligent-Unroll (line 5-6). We can see that the denition
using Intelligent-Unroll is more concise than the original
denition in Algorithm 2. Intelligent-Unroll automatically
optimizes the data access and instruction instead of rely-
ing on manual optimization. Aer dening the calculation,
users only need to specify the input and output (line 1-4) in
Intelligent-Unroll. Table 8 shows the performance compari-
son among the methods using default compiler optimization,
MKL, CSR5 and our method on both CPU and KNL.
9
Table 8. e performance of SpMV across dierent datasets on KNL and CPU.
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
Dense FEM Ship dc2 mip1
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
KN
L
CP
U
0.0
0.5
1.0
1.5
2.0
GF
lo
ps
ICC
MKL
CSR5
Our Method
Webbase1M Wind Tunnel CirCuit QCD
On KNL, our method achieves best performance on Dense
and mip1 datasets, whereas on FEM Ship, Wind Tunnel and
QCD datasets, the MKL implementations achieve best per-
formance. e CSR5 implementations also achieve best per-
formance on dc2 and CirCuit datasets. e reason why our
method achieves beer performance than other methods
is similar to the PageRank benchmark. is is because our
method is able to avoid branch prediction during runtime
and improve the memory accesses with load/store instruc-
tions. However, on the datasets where the MKL implementa-
tion is beer, the reason can be aributed to the split of the
writes to the same memory location from dierent calcula-
tion paerns in our method, which increases the load/write
instructions to the output vector y. On datasets where CSR5
achieves best performance, it is because the data structure of
input matrices is friendly to CSR5 format and corresponding
calculation paern, which has not been integrated in the
optimization pass of Intelligent-Unroll yet.
On CPU, our method achieves the best performance on
Dense, mip1, Wind Tunnel and CirCuit datasets. On the
datasets where our method fails to achieve the best per-
formance, the reason can be aributed to the limited vector
length on CPU that diminishes the advantage of Intelligent-
Unroll by avoiding the branch prediction during runtime
due to the small number of conditions. In sum, compared
to baseline, MKL and CSR5, our method improves the per-
formance of SpMV by 54.8%, 24.9% and 35.7% on average
(151.0%, 116.9%, 112.0% on maximum) respectively on KNL,
whereas by 35.9%, 10.1% and 40.5% on average (68.2%, 48.3%
and 72.5% on maximum) on CPU.
8 Related Work
Designing ecient sparse data formats - Many sparse
data formats are proposed targeting dierent sparsity pat-
terns as well as the architecture diversity. For instance, block-
based formats are widely adopted due to the cache-friendly
design [2, 4, 5]. CSR5 [21] and CVR [29] proposed new
sparse data formats for SpMV, which focus on optimizing
the instruction parallelism and load balance. Liu et al. [22]
proposed ELLPACK to accelerate SpMV kernel on Intel KNL
processor. Choi et al. [8] proposed to use small sub-blocks,
each of which is represented as a dense matrix, to optimize
SpMV on GPUs.
Improving the temporal and spatial data reuse - Since
there are many sparse data formats available, determining
the appropriate sparse format for the irregular application is
not trivial. Friese et al. [9] and Xie et al. [30] proposed dif-
ferent performance models to determine the optimal sparse
data format. In essence, their works optimize the irregular
applications by improving the temporal reuse and spatial
data reuse with the appropriate sparse format. ere are also
many works exploring optimization works on distributed
memory architectures[3, 10]. Several loop unrolling strate-
gies [15, 23, 27] are proposed in literature. However, these
works mainly focused on selecting optimal tile size and un-
roll factor when unrolling the loop, and failed to exploit the
performance opportunity by optimizing the instructions.
Optimizing parallelization strategies - Dierent par-
allelization strategies were proposed when optimizing the
irregular applications on specic architectures [16, 18]. Jiang
et al. [14] optimized the irregular applications by paralleliz-
ing the computation using the powerful SIMD units. Buobo
et al. [6] proposed optimizations of sparse linear algebra
tailored for large-scale graph analytics. Millind et al [17]
proposed a tool called ParaMeter to prole parallelism infor-
mation of irregular programs.
9 Conclusion
In this paper, we address the limitation of traditional compil-
ers that is unable to exploit the performance opportunity for
optimizing irregular applications due to its static analysis.
We propose our solution of Intelligent-Unroll that identi-
es the regular paerns within irregular applications, and
automatically optimizes the data access and instruction for
generating more ecient code. e experiment results with
representative benchmarks on both CPU and KNL processors
demonstrate the eectiveness of our approach in optimizing
the irregular applications for beer performance compared
to the-state-of-the-art implementations.
10
References
[1] Andrew Anderson, Avinash Malik, and David Gregg. 2016. Auto-
matic vectorization of interleaved data revisited. ACM Transactions
on Architecture and Code Optimization (TACO) 12, 4 (2016), 50.
[2] Arash Ashari, Naser Sedaghati, John Eisenlohr, and P Sadayappan.
2014. An ecient two-dimensional blocking strategy for sparse matrix-
vector multiplication on GPUs. In Proceedings of the 28th ACM inter-
national conference on Supercomputing. ACM, 273–282.
[3] Ayon Basumallik and Rudolf Eigenmann. 2006. Optimizing irregular
shared-memory applications for distributed-memory systems. In Pro-
ceedings of the eleventh ACM SIGPLAN symposium on Principles and
practice of parallel programming. ACM, 119–128.
[4] Aydin Buluc¸, Jeremy T Fineman, Maeo Frigo, John R Gilbert, and
Charles E Leiserson. 2009. Parallel sparse matrix-vector and matrix-
transpose-vector multiplication using compressed sparse blocks. In
Proceedings of the twenty-rst annual symposium on Parallelism in
algorithms and architectures. ACM, 233–244.
[5] Aydin Buluc, Samuel Williams, Leonid Oliker, and James Demmel.
2011. Reduced-bandwidth multithreaded algorithms for sparse matrix-
vector multiplication. In 2011 IEEE International Parallel & Distributed
Processing Symposium. IEEE, 721–733.
[6] Daniele Buono, John A Gunnels, Xinyu e, Fabio Checconi, Fabrizio
Petrini, Tai-Ching Tuan, and Chris Long. 2015. Optimizing sparse
linear algebra for large-scale graph analytics. Computer 48, 8 (2015),
26–34.
[7] Tianqi Chen, ierry Moreau, Ziheng Jiang, Haichen Shen, Eddie Q
Yan, Leyuan Wang, Yuwei Hu, Luis Ceze, Carlos Guestrin, and Arvind
Krishnamurthy. 2018. TVM: end-to-end optimization stack for deep
learning. arXiv preprint arXiv:1802.04799 (2018), 1–15.
[8] Jee W Choi, Amik Singh, and Richard W Vuduc. 2010. Model-driven
autotuning of sparse matrix-vector multiply on GPUs. In ACM sigplan
notices, Vol. 45. ACM, 115–126.
[9] Luca Daniel, Ong Chin Siong, Low Sok Chay, Kwok Hong Lee, and
Jacob White. 2004. A multiparameter moment-matching model-
reduction approach for generating geometrically parameterized inter-
connect performance models. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 23, 5 (2004), 678–693.
[10] Raja Das, Mustafa Uysal, Joel Saltz, and Yuan-Shin Hwang. 1994.
Communication optimizations for irregular scientic computations
on distributed memory architectures. Journal of parallel and distributed
computing 22, 3 (1994), 462–478.
[11] Timothy A Davis and Yifan Hu. 2011. e University of Florida sparse
matrix collection. ACM Transactions on Mathematical Soware (TOMS)
38, 1 (2011), 1.
[12] Zachary DeVito, Niels Joubert, Francisco Palacios, Stephen Oakley,
Montserrat Medina, Mike Barrientos, Erich Elsen, Frank Ham, Alex
Aiken, Karthik Duraisamy, et al. 2011. Liszt: a domain specic lan-
guage for building portable mesh-based PDE solvers. In Proceedings
of 2011 International Conference for High Performance Computing, Net-
working, Storage and Analysis. ACM, 9.
[13] Chen Ding and Ken Kennedy. 1999. Improving cache performance in
dynamic applications through data and computation reorganization
at run time. In ACM SIGPLAN Notices, Vol. 34. ACM, 229–241.
[14] Peng Jiang and Gagan Agrawal. 2018. Conict-free vectorization
of associative irregular applications with recent SIMD architectural
advances. In Proceedings of the 2018 International Symposium on Code
Generation and Optimization. ACM, 175–187.
[15] Toru Kisuki, Peter MW Knijnenburg, and Michael FP O’Boyle. 2000.
Combined selection of tile sizes and unroll factors using iterative
compilation. In Proceedings 2000 International Conference on Parallel
Architectures and Compilation Techniques (Cat. No. PR00622). IEEE,
237–246.
[16] Milind Kulkarni, Martin Burtscher, Calin Casc¸aval, and Keshav Pin-
gali. 2009. Lonestar: A suite of parallel irregular programs. In 2009
IEEE International Symposium on Performance Analysis of Systems and
Soware. IEEE, 65–76.
[17] Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pingali,
and Calin Casc¸aval. 2009. How much parallelism is there in irregular
applications?. In ACM sigplan notices, Vol. 44. ACM, 3–14.
[18] Milind Kulkarni, Patrick Carribault, Keshav Pingali, Ganesh Rama-
narayanan, Bruce Walter, Kavita Bala, and L Paul Chew. 2008. Sched-
uling strategies for optimistic parallel execution of irregular programs.
In Proceedings of the twentieth annual symposium on Parallelism in
algorithms and architectures. ACM, 217–228.
[19] Chris Laner and Vikram Adve. 2004. LLVM: A compilation frame-
work for lifelong program analysis & transformation. In Proceedings
of the international symposium on Code generation and optimization:
feedback-directed and runtime optimization. IEEE Computer Society,
75.
[20] Baoyuan Liu, Min Wang, Hassan Foroosh, Marshall Tappen, and Mar-
ianna Pensky. 2015. Sparse Convolutional Neural Networks. In e
IEEE Conference on Computer Vision and Paern Recognition (CVPR).
[21] Weifeng Liu and Brian Vinter. 2015. CSR5: An ecient storage format
for cross-platform sparse matrix-vector multiplication. In Proceedings
of the 29th ACM on International Conference on Supercomputing. ACM,
339–350.
[22] Xing Liu, Mikhail Smelyanskiy, Edmond Chow, and Pradeep Dubey.
2013. Ecient sparse matrix-vector multiplication on x86-based many-
core processors. In Proceedings of the 27th international ACM conference
on International conference on supercomputing. ACM, 273–282.
[23] John Mellor-Crummey and John Garvin. 2004. Optimizing sparse
matrix–vector product computations using unroll and jam. e In-
ternational Journal of High Performance Computing Applications 18, 2
(2004), 225–236.
[24] Dorit Nuzman, Ira Rosen, and Ayal Zaks. 2006. Auto-vectorization
of interleaved data for SIMD. ACM SIGPLAN Notices 41, 6 (2006),
132–143.
[25] Jongsoo Park, Sheng Li, Wei Wen, Ping Tak Peter Tang, Hai Li, Yi-
ran Chen, and Pradeep Dubey. 2016. Faster cnns with direct sparse
convolutions and guided pruning. arXiv preprint arXiv:1608.01409
(2016).
[26] Jonathan Ragan-Kelley, Connelly Barnes, Andrew Adams, Sylvain
Paris, Fre´do Durand, and Saman Amarasinghe. 2013. Halide: a lan-
guage and compiler for optimizing parallelism, locality, and recompu-
tation in image processing pipelines. In Acm Sigplan Notices, Vol. 48.
ACM, 519–530.
[27] Mark Stephenson and Saman Amarasinghe. 2005. Predicting unroll
factors using supervised classication. In Proceedings of the interna-
tional symposium on Code generation and optimization. IEEE Computer
Society, 123–134.
[28] Nicolas Vasilache, Oleksandr Zinenko, eodoros eodoridis, Priya
Goyal, Zachary DeVito, William S Moses, Sven Verdoolaege, Andrew
Adams, and Albert Cohen. 2018. Tensor comprehensions: Framework-
agnostic high-performance machine learning abstractions. arXiv
preprint arXiv:1802.04730 (2018).
[29] Biwei Xie, Jianfeng Zhan, Xu Liu, Wanling Gao, Zhen Jia, Xiwen He,
and Lixin Zhang. 2018. Cvr: Ecient vectorization of spmv on x86
processors. In Proceedings of the 2018 International Symposium on Code
Generation and Optimization. ACM, 149–162.
[30] Zhen Xie, Guangming Tan, Weifeng Liu, and Ninghui Sun. 2019. IA-
SpGEMM: an input-aware auto-tuning framework for parallel sparse
matrix-matrix multiplication. In Proceedings of the ACM International
Conference on Supercomputing. ACM, 94–105.
11
