GEneral Matrix Multiply (GEMM) is the most fundamental computational kernel routine in the BLAS library. To achieve high performance, in-memory data must be prefetched into fast on-chip caches before they are used. Two techniques, software prefetching and data packing, have been used to effectively exploit the capability of on-chip least recent used (LRU) caches, which are popular in traditional high-performance processors used in high-end servers and supercomputers. However, the market has recently witnessed a new diversity in processor design, resulting in high-performance processors equipped with shared caches with non-LRU replacement policies. This poses a challenge to the development of high-performance GEMM in a multithreaded context. As several threads try to load data into a shared cache simultaneously, interthread cache conflicts will increase significantly. We present a Shared Cache Partitioning (SCP) method to eliminate interthread cache conflicts in the GEMM routines, by partitioning a shared cache into physically disjoint sets and assigning different sets to different threads. We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, a 64-core AArch64 processor with private LRU L1 caches and shared pseudorandom L2 caches (per four-core cluster). Our evaluation shows that SCP has effectively reduced the conflict misses in both L1 and L2 caches in a highly optimized GEMM implementation, resulting in an improvement of its performance by 2.75% to 6.91%. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. Among the three levels of BLAS routines, level 3 provides the most opportunities for optimization because it has the highest computational complexity (of O (n 3 )). Among all level 3 operations, GEneral Matrix Multiply (GEMM) is of the most interest because the other level 3 operations can be defined in terms of GEMM and some level 1 and level 2 operations [10] . In the past, much effort has been spent on optimizing GEMM for different architectures [5, 19, 23, 28, 29] .
Following the trend in recent decades, modern architecture design is introducing more and more cores on a single processor chip. To reduce the cost of on-chip cache memories and coherence networks, shared caches may become common in future many-core processors. In addition, non-LRU replacement policies, for example, pseudo-random, may also be used to further reduce the cache design complexity. Indeed, some high-performance processors based on the ARM technology, e.g., the Phytium processor series [1] , have already adopted this design. As a result, developing a better solution to reduce the interthread cache conflicts for shared non-LRU caches is important.
In this article, we present a Shared Cache Partitioning (SCP) method to reduce interthread cache conflicts on architectures with shared non-LRU caches. The key insight is to partition a shared cache into physically disjoint sets and assign different sets to different threads. This can be achieved by exploiting the memory mapping mechanism in a set-associative cache.
To the best of our knowledge, SCP represents the first work to address the interthread cache conflicts problem for GEMM on architectures with shared non-LRU caches. As other level 3 routines, such as TRSM, TRMM, SYMM, and HEMM, call GEMM internally, they should also benefit from SCP; i.e., the SCP methodology can apply transparently to these BLAS-3 routines as well.
The main contributions of this article are as follows:
• We propose SCP, a method for reducing the interthread cache conflicts for GEMM on architectures with shared non-LRU caches.
• We present a quantitative analysis of the negative effect of interthread cache conflicts on the GEMM performance on architectures with shared non-LRU caches.
• We have implemented SCP in the OpenBLAS library and evaluated it on Phytium 2000+, an emerging high-performance processor based on ARM's AArch64 architecture. Phytium 2000+ has 64 cores, private LRU L1 caches, and shared L2 caches with a pseudo-random replacement policy. By reducing the interthread cache conflicts effectively, SCP can improve the performance of a highly optimized GEMM implementation consistently under various parallelization configurations, by 2.75% to 6.91%.
The rest of the article is organized as follows. Section 2 gives some background on GEMM and discusses the cache conflicts incurred. Section 3 introduces our SCP method. Section 4 presents and analyzes our experimental results. Section 5 reviews the related work. Finally, Section 6 concludes.
BACKGROUND 2.1 The Memory Hierarchy Review
Modern architectures take advantage of fast on-chip caches to overcome the "memory wall." As a compromise between direct-mapped caches and full-associative caches, set-associative caches have been the dominant choice for real-world architectures. To understand how cache sharing causes interthread cache conflicts and hurts GEMM performance, we review some details about the working mechanism of set-associative caches and the memory hierarchy of modern processors.
A set-associative cache can be defined by a 4-tuple (c, l, n, nt ), where c is the cache capacity, l is the cache line size, n is the number of ways of associativity, and nt indicates the number of cores sharing the cache. The whole cache memory is divided into ns = c/l/n sets, identified by indices in S = [0, ns), with each set containing n cache lines. Generally, ns is a power-of-two. An indexing function φ : A → S is responsible for mapping the addresses in the address space A (either virtual or physical) to the cache sets in S. The data at addr ∈ A must be fetched into the cache set φ(addr ) ∈ S if requested. A classic implementation of the indexing function is shown below, with and & representing logical-shift-right and bitwise-and operations, respectively: which represents an extraction of the log 2 (ns) bits starting from the log 2 (l )-th position in addr 's binary form. It is easy to prove that any continuous memory block that is no larger than c/n cannot conflict with itself. We therefore define a quantity, way-capacity, denoted as wc = c/n, to represent this size bound for conflict-free memory blocks. The main memory, N L levels of caches, and NT processor cores form a memory hierarchy and are generally organized in terms of an N L + 2 level regular tree structure. The processor cores at level 0 are leaves, the caches at levels 1 -N L are intermediate nodes, and the main memory at level N L + 1 is the tree root. Any node in the memory hierarchy can be identified by its layer and its unique index at the layer (layer, index ). The number of nodes at layer l is NT /nt l .
Throughout this article, our example platform will be a four-core processor with N L = 2 levels of caches. Figure 1 shows its memory hierarchy and Table 1 lists its architectural parameters. Each core has thirty-two 128-bit (16B) vector registers, each capable of storing two double-precision floating-point numbers. For convenience, a processor core at level 0 is represented by its registers, which can be viewed as a special level 0 cache L0 = (512B, 16B, 32, 1) with a programmable replacement policy.
Structure of GEMM
GEMM performs a matrix-multiply-accumulate operation C = βC + αAB, where A, B, and C are matrices of sizes M × K, K × N , and M × N , respectively, and α, β are scalars. While this operation is algorithmically simple, so that a three-deep loop nest suffices to accomplish it, a highperformance implementation can be quite sophisticated due to the presence of multilevel memory hierarchies on modern processors. Figure 2 shows the structure of GEMM from the OpenBLAS library [39] . Each loop (in the original three-deep loop nest) is tiled, resulting in a total of six loops (referred to as layers [1] [2] [3] [4] [5] [6] .
In this blocked algorithm, the jj, kk, and ii loops along the N , K, and M matrix dimensions are tiled by sizes N c , K c , and M c at layers 1, 2, and 3, respectively. At layers 4 and 5, the j and i loops along the N and M dimensions are further tiled by sizes N r and M r , respectively. As a result, the innermost loop at layer 6 goes over the K dimension for a total of K c times, with each iteration performing a rank-1 update on the M r × N r submatrix of C, as shown at layer 7. In OpenBLAS, matrix C is scaled by β before the loops, so the multiply-accumulate operation at layer 7 does not have to scale C again. Tiling factors N c , K c , M c , N r , and M r are carefully selected so that matrices at each level fit into a certain level in the memory hierarchy, with the following constraints:
where es denotes the size of a matrix element, e.g., 8B for a double-precision floating-point number, and c l (0 l 3) denotes the capacity of the register file or cache at level l. By Equation (1) Goto [7] factors out the innermost three loops at layers 4 to 6 for computing C 2 = C 2 + αA 2 B 2 as an architecture-dependent kernel, known as GEBP (GEneral multiply of a Block of A and a Panel of B). It is worth noting that before GEBP is called (at layer 3),
[:] and B 1 are packed into A 2 and B 2 in a special continuous layout, so consecutive memory access is ensured within the GEBP kernel. GotoBLAS [8] and its successor, OpenBLAS [39] , implement GEMM programs based on this factorization, with GEBP highly optimized for the target processor.
As stated in Section 1, a GEMM implementation is made up by a kernel routine and an overall workload-partitioning strategy. In Figure 2 , GEBP serves as the kernel routine, and the workloadpartitioning strategy is defined by the three outer loops at layers 1 to 3. Developers can decide how to partition the workload by choosing tile factors M c , N c , and K c , and how to schedule the GEBP tasks by interchanging loop orders at layers 1 to 3. Either the jj loop at layer 1 or the ii loop at layer 3, or both, can be parallelized in a multithreaded context. Specifically, the implementation in Figure 2 chooses an jj-kk-ii loop order and parallelizes the ii loop at layer 3. Different threads work on different portions of A 1 but share the same B 1 .
As a result, there are multiple A 2 instances (one per thread) and a single B 2 instance (shared by all threads). To understand how the ii loop at layer 3 is parallelized on the four-core processor, Figure 3 shows the data accessing patterns for a single thread T 1 . While all the color-shaded submatrices are accessed by T 1 , only the submatrices highlighted by the z-curves (a subset of color-shaded ones) are packed by T 1 . In other words, each thread packs its own A 2 instance but all the threads pack collaboratively the unique B 2 instance.
Cache Conflicts in GEMM
For set-associative caches, cache misses can be categorized into capacity and conflict misses. While Equations (1) to (4) aim to avoid capacity misses, conflict misses cannot be eliminated completely.
In Figure 2 , the packed matrix A 2 will be read N c /N r times within a single GEBP execution and is expected to reside in the L2 cache during all the N c /N r iterations at layer 4. The packed matrix B 4 is similar in that it is expected to reside in the L1 cache during all the M c /M r iterations at layer 5. In general, even with carefully selected tile factors, some matrix elements can still be evicted. For example, some data structures other than the packed matrices or program code (in case of a unified cache) may be fetched into the cache set in which these matrix elements reside. Such cache conflicts are known as intrathread cache conflicts because they occur within the execution of a single thread. To deal with intrathread cache conflicts, the GEBP kernel repeatedly prefetches data from A 4 and B 4 to the L1 cache during every iteration at layer 6 before it actually reads them 
just in case the data may have been evicted accidentally from the L1 cache. As GEBP accesses the matrix data much more frequently than other data structures, this repeated prefetching strategy can reduce the penalty caused by intrathread cache conflicts. In contrary to intrathread cache conflicts, there exists another kind of cache conflicts, called interthread cache conflicts, on architectures with shared caches. Suppose two working threads, T 0 and T 1 , are running independently on the four-core processor in Figure 1 to accomplish a GEMM operation. At time t s , T 0 prefetches some data at address addr 0 to the cache set φ 2 (addr 0 ) in the shared L2 cache. The data will be actually read at time t e . Then at some time t m (t s < t m < t e ), T 1 prefetches data at address addr 1 to the L2 cache. If φ 2 (addr 1 ) = φ 2 (addr 0 ), then the data at addr 1 and addr 0 will be fetched into the same cache set. As the shared L2 cache has a non-LRU policy, the data at addr 0 may suffer an eviction before it is actually read; i.e., a conflict miss occurs.
In GEMM, every cache is utilized aggressively so that almost all of its cache space is expected to be occupied by packed matrices. Prefetch instructions from different threads are executed repeatedly, interleaved in an unpredictable manner. So interthread cache conflicts described above can occur frequently for shared non-LRU caches, thus making the technique dealing with intrathread cache conflicts ineffective. The situation worsens with more threads sharing the same cache.
SCP: THE SHARED CACHE PARTITIONING METHOD
This section introduces our SCP method for reducing interthread cache conflicts. The key insight is that interthread cache conflicts can be eliminated if packed matrices used by different threads are fetched to different cache sets in a shared L2 cache. In Section 3.1, we illustrate SCP by an example. In Section 3.2, we give the algorithms in our SCP method.
An Example
Suppose we want to implement a double-precision GEMM (DGEMM) on our four-core example processor, where es = 8B. We now determine the tiling factors used based on the constraints in Equations (1) through (4). Many solutions for (M r , N r ) can satisfy Equation (1), including (4, 8) , (8, 4) , (6, 8) , and (8, 6). A tuning process returns M r = 4 and N r = 8 as the best. By Equa-
To leave space for other data structures and the program code, it is reasonable to shrink M c slightly, leaving, for example, 192, 208, and 224 as good candidates. Again, a tuning process yields M c = 192. As there is no L3 cache, Equation (4) can be ignored and N c can be given a large value. Here, we choose N c = 1,024NT = 4,096. Given these tiling factors, the sizes of packed matrices can be calculated. Table 2 lists the sizes of various packed matrices used.
Each A 2 is packed into a single continuous buffer. from different threads will reside in strictly disjoint cache sets, enabling the interthread cache conflicts caused by A 2 to be eliminated completely! Figures 4(a) and 4(b) essentially represent two different partitioning styles for a shared cache, referred to as way-partitioning and set-partitioning, respectively. To obtain the data layout in setpartitioning, A 2 can no longer be stored in a single continuous buffer. Instead, A 2 will be distributed to 12 continuous memory segments, each of size 32KB. The distance from one segment to the next is wc 2 = 128KB. While avoiding interthread cache conflicts, set-partitioning introduces some complexity to GEMM implementations because the GEBP kernel and the matrix packing routines now must work with segmented memory buffers instead of continuous ones.
What about B 2 , the other packed matrix used in GEBP? Unlike A 2 , which is thread private, B 2 is shared among all the threads and the packing of B 2 is done collaboratively by all the threads. There are two choices (with different tradeoffs): (1) privatize B 2 and apply set-partitioning and (2) fall back to way-partitioning for B 2 (as done traditionally). B 2 can be made thread private if every thread makes its own pack of the whole B 1 matrix. The first choice achieves a full per-thread data isolation at the expense of redundant packing overhead. The second choice avoids the extra overhead in both time and space but potentially increases the interthread cache conflicts caused by the shared B 2 matrix. In this work, these two choices are evaluated and compared in Section 4.
Algorithms
Given the tiling factors M r , N r , K c , M c , N c , and architectural details of the memory hierarchy, SCP will systematically determine the memory layout for the packed matrices A 2 and B 2 in GEMM. The memory layout of the other packed matrices at lower layers is determined once the memory layout of A 2 and B 2 is determined. We assume that the shared matrix B 2 is packed in the conventional way-partitioning style. If desired, B 2 can be privatized and handled in the same way as A 2 .
We make a few standard assumptions, which usually hold for the architecture considered:
• The architecture has two or three levels of caches.
• All caches are inclusive set-associative caches.
• The caches at the same level are homogeneous.
• A cache's way-capacity is a multiple of the sum of way-capacities of its children, i.e., (
We use a memory descriptor D M t to specify the memory space occupied by a packed matrix M used by a thread t. The memory descriptor is essentially a subspace of the whole address space, i.e., D M t ⊂ A. Generally, D M t consists of a sequence of N M t disjoint memory segments, Require:
if nt l /nt l −1 > 1 and L l is non-LRU then 6: aliдn ← lcm(aliдn, wc l /nt l ) 
A segment is represented by its start and end addresses,
, where s i < e i ∈ A. With way-partitioning, the memory descriptor contains only one segment. With set-partitioning, the memory descriptor contains a sequence of equally strided segments.
SCP runs in three phases, as shown in Algorithms 1 to 3, respectively. The first phase allocates the memory buffers for all A 2 and B 2 instances used in GEMM. There are two buffers, D p and D s , for storing private and shared matrices, respectively. The second phase computes the memory descriptors for A 2 instances by partitioning D p into NT thread-private buffers D 2) . Because the memory space in D p will be set-partitioned, extra efforts are made to ensure a proper alignment for D p . Then aliдn is found (lines 3-8) and size p is enlarged to a multiple of aliдn (line 9). Specifically, aliдn is initialized with the L1 cache line size l 1 . Then for each level l (line 4) with shared non-LRU caches (line 5), aliдn is updated by computing the least common multiple of the earlier aliдn and wc l /nt l (line 6). The insight on using wc l /nt l is that each way of the level l cache should be equally partitioned among the nt l threads in set-partitioning. Finally, memory is allocated (lines 10-11) and buffers D p and D s are created (lines 12-13).
In Algorithm 2 (Phase 2), the main component is a recursive procedure subspace (lines 5-26). This procedure traverses the memory hierarchy to compute for each node, including the processor cores and main memory, a subspace of D p . There are three input parameters, l and idx for identifying the node on which it is running, and a subspace 
Require: L
return 9: end if 10: nchildren ← nt l /nt l −1
11:
for i = 0 to nchildren − 1 do 12:
len ← (ub − lb)/nchildren 17:
else 19: len ← ns l /nchildren 20:
21:
end if 23 :
end for 26: end procedure for computing D temp . If the node is main memory or an LRU cache (line 13), D idx l is partitioned in the conventional way-partitioning style. Otherwise, the else branch (line 18) performs a setpartitioning on D idx l . Initially, the main memory is the whole D p (line 1) and nt N L+1 is set to NT (line 2) because the main memory is shared by all the threads. Then the subspace procedure starts from the main memory (line 3) and traverses the memory hierarchy in a depth-first-search order.
In Algorithm 3 (Phase 3), the computation of D
t is quite simple. This is done by dividing D s into NT partitions, one for each thread. First, the range of D s is obtained (lines 1-2) and the capacity of D s is equally divided among the NT threads (line 3). Then D s is partitioned in a waypartitioning style and each thread t obtains the buffer space for its part in B 2 (lines 4-6).
PERFORMANCE EVALUATION
We have implemented SCP in the OpenBLAS [39] library and evaluated it on a Phytium 2000+ processor. The Phytium 2000+ processor is an emerging high-performance 64-core processor based on ARM's AArch64 architecture. The 64 cores are organized into 16 clusters with each cluster containing four cores. The structure of a cluster is almost the same as the four-core processor in Figure 1 except that the L2 caches of all 16 clusters connect to the main memory with hardware coherence. Table 1 lists the architectural parameters for the cores and caches in this processor and Table 3 shows the software environment used. The L2 caches in the Phytium 2000+ processor are Require:
end for physically indexed. In our implementation, physically continuous memory buffers (allocated by the HUGETLBFS filesystem provided by the Linux kernel) are used for storing matrices in packed form. Therefore, for a given shared L2 cache, the matrix data accessed by different threads are guaranteed to be free of conflict cache misses. We restrict our evaluation to DGEMM, as in prior work [23, 30, 36] , for two reasons. First, the basic idea behind SCP applies to other variants of GEMM such as SGEMM, CGEMM, and ZGEMM. Second, the LINPACK benchmark, which is used to build the TOP500 list of the world's most powerful supercomputers [2] , relies on the DGEMM variant.
In our evaluation, we use the tiling factors computed as in Section 3.1:
, as this combination causes GEMM to deliver the best performance when SCP is not used. The degree of parallelism is controlled by two parameters, the number of active clusters NC and the number of active threads per cluster NT C . The total number of threads is NT = NC × NT C . For the Phytium 2000+ processor, we have NC ∈ [1, 16] and NT C ∈ [1, 4] . Generally, the DGEMM performance is measured in f lops. As NT varies in our evaluation, we use another metric derived from f lops, the average thread efficiency, to describe our performance results consistently. The average thread efficiency is a normalized value computed as E avд = f lops/(NT × f lops), where f lops stands for the theoretical performance peak of a single core.
The matrix sizes selected range from 3,072 to 4,096 when NT ∈ {4, 8} and to 6,144 when NT ∈ {16, 32, 64}. These sizes are large enough to make GEMM run stably at peak performance given the total number of threads participating in the computation.
We show how cache sharing can hurt GEMM performance (Section 4.1), demonstrate how SCP can reduce the penalty caused by cache sharing (Section 4.2), conduct a quantitative analysis of the cache miss rates (Section 4.3), and discuss the privatization of the shared matrix B 2 (Section 4.4).
Penalty of Cache Sharing
In a cluster, the L2 cache is shared by its four cores. The impact of cache sharing can be demonstrated by comparing the DGEMM performance results achieved with the same NT but different NT C s. Figure 5 shows the results. Each curve represents a parallelization configuration. Different subfigures are configured with different NT s, and different curves in the same subfigure are configured with different NT C s. As there is only one configuration for NT = 64 (NC = 16 and NT C = 4), no comparison is available. Thus, the results for NT = 64 and NT = 32 are combined in Figure 5 (d). For all NT s, NT C = 1 achieves the best performance, followed by NT C = 2, and NT C = 4 comes at the lowest efficiency. Table 4 summarizes the average value of E avд across all the matrix sizes. Vertically, the results in the same column show how E avд varies with NT C . For all NT s, NT C = 1 outperforms NT C = 4 by a large margin. The performance gap is 7.4% when NT = 4 and roughly 10% when NT > 4. Horizontally, the results in the same row reflect the scalability of DGEMM with certain NT C s. With NT C = 1, the performance scales linearly with the number of threads. With NT C = 2, E avд suffers a 1.86% loss when NT grows from 4 to 32. The situation is much worse for NT C = 4, in which E avд drops by 14.19% from NT = 4 to NT = 64.
This shows clearly that cache sharing has a considerable impact on the DGEMM performance. 
Effectiveness of SCP
We evaluate SCP using the same method as in Section 4.1. SCP is not evaluated with NT C = 1 because there is no need to partition the cache if NT C = 1. Figure 6 shows the results. For comparison purposes, the results without SCP (denoted as Base) are also presented. Under all the parallelization configurations, SCP performs better than Base consistently across all the matrix sizes evaluated. Table 5 summarizes the performance improvement of SCP over Base. With NT C = 2, SCP performs slightly better than Base by 1.7% to 1.8%. The performance gap becomes larger with NT C = 4, ranging from 2.75% to 6.91%. The largest performance gain, 6.91%, which is a considerable improvement, is observed under the maximum parallelism with NT = 64.
In Figures 5 and 6 , the performance drops slightly as the matrix size increases when NT ≥ 16. Let us explain the reason behind this. There is a synchronization point at the end of each iteration of the loop at layer 2 ( Figure 2 ). This synchronization is an all-to-all communication, whose overhead is proportional to NT 2 . This overhead is small when NT is 4 or 8 but increases quickly as NT increases. In addition, the performance also drops when matrix sizes are multiples of 1,024, as GEMM suffers from more memory contention at these matrix sizes. To understand this better, we have profiled the GEMM execution for the matrix sizes 3,968 and 4,096 under NT C = 4 and NC = 4. The measured bandwidth for packing A (B) is 1.44 (0.89) GB/s for the matrix size 3,968 but drops significantly to 0.81 (0.44) GB/s for the matrix size 4,096. As the memory banks in the Phytium 2000+ processor are accessed in a nonbalanced manner, the tiling factors M c , N c , and K c selected make it more likely for GEMM to suffer from memory contention at matrix sizes that are multiples of 1,024.
Cache Miss Rate Analysis
To understand why SCP outperforms Base and why SCP is effective in reducing interthread cache conflicts, we use the PAPI profiling tool [26] to analyze the cache behavior of GEMM. A total of four cache-related hardware events are counted, which are listed and briefly described in Table 6 . Based on these hardware events, the miss rates of L1 and L2 caches can be estimated as L1M L1M +L1H and L2M L2M +L2H , respectively. In this analysis, the hardware events are measured in a per-thread manner and all the results are the average measurements over all the threads. Figure 7 shows the miss rates of L1 and L2 data caches. We only give the results with NT C = 4 as SCP achieves the largest improvement with full cache sharing. Table 7 summarizes their average reductions across the matrix sizes used. Under all the parallelization configurations, both L1 and L2 cache miss rates are reduced under SCP, by roughly 13% for L1 and nearly 10% for L2.
For the Phytium 2000+ processor whose L1 caches are private, only L2 caches are specially handled by SCP. However, the miss rates for both L1 and L2 caches are reduced effectively by SCP. While a reduction in the L2 miss rate is expected, why is the L1 miss rate also reduced? In this processor, every L2 cache is inclusive. The conflict misses for the L2 cache not only cause the data to be evicted from L2 but also invalidate the evicted data's copy in all its upper-level L1 caches, resulting in more L1 cache misses. In general, the conflict misses for an inclusive cache can affect the performance of all its upper-level caches. As a result, improving the miss rate of a shared L2 cache also improves the miss rates of all its upper-level L1 caches. Our results show that SCP can effectively reduce the conflict misses in shared L2 caches. In addition, the performance benefit thus obtained also propagates to their upper-level caches too.
Privatization of Shared Matrix
In the previous evaluations, SCP only set-partitions the shared caches for A 2 matrices, which are private to threads. The shared B 2 matrix is still packed in the conventional way-partitioning style. As mentioned in Section 3.1, set-partitioning can also be applied to B 2 after B 2 is privatized. In this section, we discuss and evaluate this privatization alternative.
Privatization incurs more packing overhead because the packing of B 2 must be done redundantly. Figure 8 shows three packing strategies for B 2 in a parallelization configuration specified by NT C = 4 and NC = 4. In each case, the entire packing workload of B 2 is divided into 16 tasks. So only the packing workload of one thread, T 1 , which is highlighted by the z-curves (as shown), is illustrated. In the conventional packing approach (Figure 8(a) ), each thread takes a single task. In full privatization (Figure 8(b) ), all tasks are replicated to all the threads. As a result, the packing overhead grows in proportion to NT , which is unacceptable if GEMM is highly parallelized, e.g., when NT = 64. In partial privatization (Figure 8(c) ), which is an alternative to full privatization, privatization only occurs inside a cluster so that the tasks for one cluster need not be replicated in the other clusters. In this case, the extra packing overhead is bounded by NT C . For example, on Phytium 2000+, the overhead of packing B 2 in partial privatization is limited to 4 times of that in the conventional packing approach. Figure 9 shows the results for partial privatization with NT C = 4 and NC = 4. The results for the other configurations are similar. For comparison purposes, SCP with and without privatization, denoted as SCP-P and SCP, respectively, are both given in Figure 9 . Figure 9(a) shows the L1 and L2 cache miss rates, which are slightly lower with privatization (SCP-P) than without (SCP). Figure 9 (b) shows the average thread efficiency. Despite its lower L1 and L2 cache miss rates, privatization (SCP-P) suffers a performance loss (relative to SCP) due to the extra packing overhead introduced by privatization. Figure 9 (c) compares the three types of major overheads in percentage terms over the total execution time, including (1) packing of A 2 , (2) packing of B 2 , and (3) synchronization. The solid and dashed horizontal lines represent the average values of the total overheads across the matrix sizes under SCP and SCP-P, respectively. While behaving similarly as SCP in (1) and (3), SCP-P is worse than SCP in (2) . To understand this further, Table 8 gives a breakdown of each's execution time in terms of (1) through (3) and the computation time of GEBP. As the overhead for packing B 2 is multiplied by a factor of NT C = 4, the total overhead has thus increased from 2.35% (under SCP) to 4.72% (under SCP-P).
Our results demonstrate that privatizing the shared matrix B 2 can further reduce interthread cache conflicts. However, the improvement in cache performance is small. As B 2 only occupies a small portion of the shared L2 cache during GEMM execution, this marginal improvement is not sufficient to balance the extra packing overhead introduced by privatization. 
RELATED WORK
The idea that all level 3 BLAS operations can be built on top of a high-performance GEMM implementation was first proposed in [10, 11] . The GotoBLAS library [8] and its successor Open-BLAS [39] are instantiated based on this insight. So optimizing GEMM has always been the central task in developing dense linear algebra software. The GEMM optimization has two aspects. One is developing fast computation kernels and the other is choosing a proper overall workloadpartitioning strategy to optimize memory access. There are several approaches for obtaining optimized GEMM kernels, yielding different tradeoffs between performance and portability. In GotoBLAS [8] , OpenBLAS [39] , and BLIS [27] , the kernel component of GEMM is written by domain experts in assembly. ATLAS [31] adopts the auto-tuning method to automatically generate kernels with different parameters in C and find the best-performing one by running them on the actual computing system. POET [34] [35] [36] and AUGEM [30] use a directive-based programming approach. Given a GEMM kernel in C, POET inserts annotations into the C code to direct source-to-source compiler transformations. AUGEM uses a template-based method to match predefined patterns in the C code and transforms the matched C code sequence into an optimized assembly code sequence. Both the auto-tuning and directive-based programming approaches rely on the compiler to transform kernels in C to machine instructions. Kernel performance can be improved with compiler techniques, including SIMD vectorization [6, 16, 22, 40, 41] , polyhedral optimization [3, 13] , and loop tiling [15, 24, 32] . Recently, Poca [25] has leveraged a wide range of architecture-specific abstractions available in the LLVM compiler infrastructure and proposed a series of domain-specific yet architecture-independent optimizations to obtain high-performance kernels in a portable way.
The overall workload-partitioning strategy used in GEMM is mainly concerned with choosing tiling factors, loop orders, and parallelization techniques. The tiling factors M r , N r , M c , N c , and K c are essential to GEMM performance. ATLAS [31] relies on auto-tuning to determine optimal values for these factors. Analytic techniques [12, 20, 37] can also be used instead of auto-tuning. These analytic methods generally take into consideration the cache capacity and associativity but assume an LRU replacement policy. As far as we know, this article is the first to address the problem of cache sharing with non-LRU replacement policies, and SCP serves as the first work to solve this problem. In [7] , a detailed discussion on choosing different loop orders is given. GEMM is usually parallelized only at layer 3 (the ii loop), but all loops along the M and N matrix dimensions are potentially parallelizable. BLIS [23] allows developers to specify a sophisticated configuration so that more than one loop at layers 1, 3, 4, and 5 can be simultaneously parallelized. This nested parallelization is well suited to complex architecture features like multisockets and hyperthreading.
As GEMM is highly optimized and parallelized, other higher-level dense linear algebra operations, such as Cholesky, LU, and QR factorization, can enjoy its performance benefits by building themselves on top of GEMM. In other words, this programming pattern in dense linear algebra software harnesses a fork-and-join-style parallelism. To reduce synchronization overhead, the PLASMA project [4, 14, 33 ] makes use of a task-driven parallelization paradigm, in which dense linear algorithms are formulated in terms of directed acyclic graphs (DAGs) of fine-grained tasks. The tasks are dynamically scheduled based on their data dependencies (represented by the graph edges). The basic idea behind SCP also applies to such DAG-based parallelization. Input matrices of different tasks can be packed into different sets of a shared cache. As a result, the tasks can be scheduled based on the cache sets in which the input matrices reside, so that the tasks whose input matrices reside in the same cache set are scheduled to the same core.
SCP is a cache partitioning technique. While SCP is specific to GEMM running on architectures with non-LRU caches, there are general cache partitioning techniques [9, 18, 21] that focus on estimating the interference across the multiple applications sharing the same cache and improving the overall system performance. In these research efforts, either hardware modifications are required or some sophisticated cache replacement policies are needed. In contrast, SCP is a pure software approach that works for commercial processors.
CONCLUSION
In this article, we present a shared cache partitioning method, SCP, to reduce interthread cache conflicts in GEMM programs on architectures with shared non-LRU caches. The basic idea is to partition a shared cache into physically disjoint sets and assign different sets to different threads. Our experimental results show that SCP can effectively reduce the interthread cache misses for a shared cache (and its upper-level caches), resulting in a considerable improvement in the GEMM performance in a multithreaded context.
