Hardware-aware graph coloring techniques for efficient sparse linear algebra by Thies, Jonas & Alappat, Christie Louis
Hardware-aware graph coloring techniques for efficient sparse linear 
algebra 
Jonas Thies (German Aerospace Center) 
Christie Louis Alappat, Georg Hager and Gerhard Wellein 
(University of Erlangen-Nuremberg) 
 















• ARM processor (energy efficient) 
• 48 cores 
• about 1 TB/s memory bandwidth 
• NUMA 
• 512 bit SIMD 
 





for(int row=0; row<nrows; ++row) 
{ 
    temp=0; 
    for (int j=rowptr[row]; j<rowptr[row+1]; ++j) 
    { 
        temp += A[j] * x[col[j]]; 
     
    } 
    b[row] += temp; 
} 
#pragma omp parallel for 
Irregular data accesses 









for(int row=0; row<nrows; ++row) 
{ 
    temp=0; 
    for (int j=rowptr[row]; j<rowptr[row+1]; ++j) 
    { 
        temp += A[j] * x[col[j]]; 
        b[col[j]] += A[j] * x[row]; 
    } 
    b[row] += temp; 
} 
A b 











for(int row=0; row<nrows; ++row) 
{ 
    temp=b[row]; 
    for (int j=rowptr[row]+1; j<rowptr[row+1]; ++j) 
    { 
        temp -= A[j] * x[col[j]]; 
    } 
     diag = A[rowptr[row]]; 
    x[row]  = temp/diag; 
} 
A x 




Sparse Matrix Operations – 
Data Dependencies 







Multicoloring[6] (ABMC) Does it have an 
impact on 
performance ? 
M. T. Jones and P. E. Plassmann, Scalable iterative solution of sparse linear systems  URL:https://doi.org/10.1016/0167-8191(94)90004-3. 
T. Iwashita, H. Nakashima, and Y. Takahashi, Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in iccg method 
    URL:https://doi.org/10.1109/IPDPS.2012.51 







Multicoloring[6] (ABMC) Does it have an 
impact on 
performance ? 
M. T. Jones and P. E. Plassmann, Scalable iterative solution of sparse linear systems  URL:https://doi.org/10.1016/0167-8191(94)90004-3. 
T. Iwashita, H. Nakashima, and Y. Takahashi, Algebraic block multi-color ordering method for parallel multi-threaded sparse triangular solver in iccg method 
    URL:https://doi.org/10.1109/IPDPS.2012.51 
Is it  
GOOD or BAD 






Ivy Bridge E5-2660 v2 @ 2.2 GHz 
Performance Engineering – Roofline Model 
 
for(int row=0; row<nrows; ++row) 
{ 
    temp=0; 
    for (int j=rowptr[row]; j<rowptr[row+1]; ++j) 
    { 
        temp += A[j] * x[col[j]]; 
    } 














for(int row=0; row<nrows; ++row) 
{ 
    temp=0; 
    for (int j=rowptr[row]; j<rowptr[row+1]; ++j) 
    { 
 temp += A[j] * x[col[j]]; 
 b[col[j]] += A[j] * x[row]; 
    } 














𝑃𝑆𝑦𝑚𝑚𝑆𝑝𝑀𝑉 = 𝐼𝑆𝑦𝑚𝑚𝑆𝑝𝑀𝑉 ∗ 𝑏𝑠 
𝑃𝑆𝑝𝑀𝑉 = 𝐼𝑆𝑝𝑀𝑉 ∗ 𝑏𝑠 Measure α𝑆𝑝𝑀𝑉  
α
𝑆𝑝𝑀𝑉





Is it  
GOOD or BAD 





Ivy Bridge E5-2660 v2 @ 2.2 GHz 
Recursive Algebraic Coloring Engine 
 
 RACE uses a recursive level based method. 
 
Objectives motivated by hardware efficiency 
Preserve data locality (lower α factor). 
Generate sufficient parallelism to support hardware underneath. 
Reduce synchronization overheads. 











Sparse Matrix  Graph 
Example of  2D-7 Point stencil 
Step 1: Level Construction – BFS 
Stencil Sparse Matrix Graph 
7. C. Y. Lee, An algorithm for path connections and its applications, URL:https://doi.org/10.1109/TEC.1961.5219222. 
Example of  2D-7 Point stencil 
Step 1: Level Construction – BFS 
Example of  2D-7 Point stencil 
Step 1: Level Construction – BFS 
Original graph Permuted graph 
Example of  2D-7 Point stencil 
* Store   
level_ptr 
Step 1: Level Construction – BFS 
Example of  2D-7 Point stencil for D2 coloring 
Step 2: Distance-k coloring 
𝐿 𝑛  𝑎𝑛𝑑 𝐿 𝑛 + 𝑘 + 𝑖    
 ∀ 𝑖 > 1 are 
distance-k independent 
Example of  2D-7 Point stencil for D2 coloring 
But … 
𝑛𝑟 𝑇 0 = 3 
𝑛𝑟 𝑇 4 = 13 
Load imbalance 
Step 3: Load-balancing 
Exploit   only 
parallelism required by 
hardware 
Equal amount of work 









Step 3: Load-balancing 
Example of  2D-7 Point stencil for D2 coloring 
5 threads 
More levels 
since 𝑛𝑟 on 











Parallelism limited by 





Example of  2D-7 Point stencil for D2 coloring for 5 threads 8
Need more 
parallelism 




graph based on 
load-balancing 
Recursion 






Ivy Bridge E5-2660 v2 @ 2.2 GHz 
     Matrix source 
University of Florida
[5]
 collection  









Performance Comparison with RLM and MKL– Skylake (20 threads) 
     Matrix source 
University of Florida
[5]
 collection  









Performance Comparison with MC and ABMC– Skylake (20 threads) 
Small matrices  
 
ABMC : on par 
MC,MKL     : 2.25 x 
Large matrices  
 
ABMC : 1.8 x 
MC,MKL     : 2.25 x 
• RACE also improves convergence of e.g. Gauß-Seidel and Kaczmarz  
compared to MC and ABMC (ongoing study) 
 
• Light-weight software for setup and kernel application: 
 https://bitbucket.org/ChristieAlappat/RACE-AD 
 
• Several awards for Christie: SPPEXA best M.Sc. thesis 2017, SC18 ACM student 
research competition, 2nd place in 2019 
 
• Preprint “Recursive Algebraic Coloring Technique for Hardware-Efficient Symmetric Sparse 
Matrix-Vector Multiplication“ (available on request) 
 
 
 
Final remarks 
