Adapting a source code to the specificity of its host hardware represents one way to implement software optimization. This allows to benefit from processors that are primarily designed to improve system performance. To reach such a software/hardware fitting without narrowing the scope of the optimization to few executions, one needs to have at his disposal relevant performance models of the considered hardware. This paper proposes a new method to optimize software kernels by considering their data-access mode. The proposed method permits to build a data-cache-miss model of a given application regarding its specific memory-access pattern. We apply our method in order to evaluate some custom implementations of matrix data layouts. To validate the functional correctness of the generated models, we propose a reference algorithm that simulates a kernel's exploration of its data. Experimental results show that the proposed data alignment permits to reduce the number of cache misses by a factor up to 50%, and to decrease the execution time by up to 30%. Finally, we show the necessity to integrate the impact of the Translation Lookaside Buffers (TLB) and the memory prefetcher within our performance models.
INTRODUCTION
Over the last few decades, each new generation of hardware (HW) platform overcame the former one by introducing a brand new approach for memory storage or computation. This has led High-Performance-Computing (HPC) developers to have a very large set of potential software (SW) optimizations methods applicable to their code. However, the considered HW platforms are very heterogeneous in terms Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. RSP'19, October 17-18, 2019 , New York, NY, USA © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6847-6/19/10. . . $15.00 https://doi.org/10. 1145/3339985.3358498 of Instruction Set Architecture (ISA), registers and memory hierarchy. Consequently, the most commonly-used SW optimizations are specific to the family of the given host HW. As an example, for a simple matrix multiplication application, [12] provides a set of SW optimizations that may be implemented and the corresponding gain. We note from this example that the code-efficiency is reached at the expense of its portability, i.e. through code specialization and using hardware-specific instructions. Thus, porting a given source code to a new family of HW and benefiting from its performance may require an important engineering effort.
The objective of our study is to walk a first step toward automatic code-adaptation of a given SW kernel to its specific host HW. In this paper, we present a custom method to estimate the data-cache-miss ratio depending on HW parameters related to the memory hierarchy (e.g. cache-line size, number of cache lines). The generated estimation depends also on some OS-related parameters such as the memory-allocation policy. The proposed method is specifically designed to be embedded within a SW platform to automatically detects the adequate code optimizations to apply to a given source code. The objective being to make it fit specifically its host HW. To illustrate our method, we consider the example of matrices accessed through memory-patterns similar to simple matrix multiplication and convolution algorithms [14] . We consider matrix lines and columns that have been dynamically allocated (within the heap of the process). We assume that the data-caches implement the "least-recently used" (LRU) cache-replacement policy. No assumption is made on the cachewrite policy. Using the generated models, we pick the ideal data-layout implementation for the considered kernel. This allows to reduce the number of cache misses by a factor up to 50%, and to decrease the execution time up to 30%. It is noteworthy that our objective is not to propose a new matrix multiplication or convolution algorithm. Other specialized implementations and libraries ([17] , [9] and [16]) exist and are proven to reach peak performance. However, we pick this example as a visual way to illustrate our general-purpose method.
The rest of the paper is organized as follows. Section 2 discusses the state-of-the-art. Section 3 provides a background on data-layout families to store matrices. Section 4 details the proposed method to build a data-cache-miss model parametrized.
Section 5 experiments the performance-interest of the matrix data-layouts. Finally, section 6 concludes the paper.
STATE OF THE ART
Code optimization is a potentially endless topic in modern computer science. Various research solutions have been widely studied, ranging from just-in-time compilation [13] to polyhedral compilation [2] and source-to-source transformation [7] . We group some of these solutions in Table 1 and classify them according to their leverage point within an application's life cycle. We note from this table that the ecosystem of SW optimization is highly detached from HW consideration. No particular attention is given to the specificity of the host hardware.
Algorithm Source Code
Compilation (static) Meanwhile, code specialization to a specific host hardware is still a barely-explored way of code optimization. The evidence lies in the fact that most existing hardware-performance models, which are crucial for such a approaches, are hardly exploitable in automatic SW optimization. The example of the roofline model [18] is representative of this trend. Indeed, this model is a couple of constant thresholds the performance of a kernel can not exceed. Thus, it cannot be used to spot the hardware parameters (such as the cache sizes, associativity or replacement policy) that may influence most the kernel's performance. Our aim through this paper is to propose a method to build cache-miss models that exhibits a higher correlation between the source-code performance and the host-hardware characteristics. To the best of our knowledge, the closest data-cache-miss modeling methods similar to our work are proposed in [19] and [1] . These methods estimate the data-cache-miss model of a program by analyzing its corresponding data-reuse distance (defined as the number of distinct data elements accessed between two consecutive reference to the same element [19] ). Unlike our method, these methods exhibit models that do not depend on the considered memory-access pattern of the algorithm but on the input data used during the test execution. Thus, the models generated by these methods are only valid for a specific set of input data.
BACKGROUND:
MATRIX DATA-LAYOUTS When dealing with memory for performance-optimization, two aspects, highly interleaved, need to be considered: the memory layout used for its storage and the pattern followed to access the addresses.
In this section, we give an overview of the data-layout architectures that we have considered. By data-layout we refer to the geometrical shape followed by data-addresses. The datareorganizations that we deploy being at virtual-addresses level. We also assume memory access patterns similar to common matrix multiplication [5] and convolution [14] .
Flattening 2D Structures Within 1D Array
A first way to access a cell (x,y) from a dynamically-declared uni-dimensional line-major matrix array is using equation: @ (x,y) = @ base + (xN + y)D. Thanks to its simplicity, this method allows to access a cell (at address @ (x,y) ) in roughly one computation and only one memory access (assuming that the initial address @ base of the array and the values of x and y are stored within processor-registers). Furthermore, keeping cells that belong to contiguous lines within the same block of addresses contributes to cache and page locality. Indeed, a residual from a matrix line (fetched within the caches) has a high probability to be used immediately afterward if it contains the following matrix line. It also leads to a high prediction-hit ratio for the prefetcher by creating a high regularity within the accessed addresses. Given the growing impact of memory wall, this high locality and relatively-reduced number of memory-accesses represents an interesting performance advantage. The main limitation of this uni-dimensional data-layout family is related to the lack of scalability with respect to the number of concurrent threads. The relative proximity between addresses belonging to independent lines increases their probability to be set within the same cache-line. Accessing these addresses concurrently would thus trigger false sharing [10] , which may increase the access time by up to 100 CPU cycles. Most methods ([15] and [4]) proposed to reduce false sharing are based on memory alignment. Thus, they can not be applied to the current data layout without prohibitively increasing its memory-access cost.
Multidimensional Matrix Storage
A second way to store a dynamically-declared matrix is using a multidimensional array. Each cell of the array is a pointer to either a payload-data array (Figure 1a ) or a pointer to another pointer's array (Figure 1 ). Each array (from each dimension) is dynamically allocated independently from the others. Similar principles are used by the java implementation of N-dimensional data layouts. The main advantage of using a multidimensional matrix storage is its modularity. The data may be split with respect to any dimension (projection on hyper plans) in order to fit an ideal workload distribution among threads or to suite the hardware and OS specifications (cache line size, virtual page size or process-fork buffer). Figure 1 shows how adapting the datalayout according to the access pattern helps to significantly reduce the number of cache misses. We may thus improve scalability with respect to the workload, number of threads 1 and hardware dimensions.
In the rest of this paper and for the sake of clarity, we will always consider multidimensional matrices as line-major. We will also assume a line-major exploration.
PROPOSED DATA-CACHE-MISS MODEL
In this section, we propose a method to accurately model the number of data-cache misses triggered while accessing each family of data-layout presented in section 3. The objective being to highlight the parameters (such as data alignment or block subdivision) that may have a significant impact on data-cache misses (when dealing with access-pattern similar to matrix multiplication and convolution).
Data-Cache-Miss Modeling on a 1D Memory Block
The data of the considered kernels is accessed sequentially. Consequently, the proposed method to build a cache-miss model is to first consider a simple 1D array of N elements at the address a0 and where each element is of size D bytes. The data are being fetched within a cache made of C total lines where each one is of size C. Throughout all our modeling process, we assume that all this constants belong to N * (set of purely-positive natural numbers).We also assume that the array has not been covered yet (hence it is not present in the considered cache).
1 It would for instance help implement the previous solutions for false sharing
The number n0 of cache-misses triggered while accessing the array sequentially is given by Equation 1 (the notation a0[C] refers to the rest of the euclidean division of a0 by C).
Accessing one byte at an address a0 leads to fetch the corresponding data at a position a0[C] within a cache line. The rest of the cache line being populated with the data at the addresses surrounding a0. Consequently, at the initial data access (address a0), C −a0[C] bytes from the array are fetched into the cache. The next data access to trigger a cache miss (a0 + C − a0[C]) will thus be aligned with a cache-line size C (because a0 + C − a0[C] ≡ 0(modC)). Then, the data are fetched by chunks of size C bytes. The number of fetch processed is found by dividing the number of remaining bytes after the first access (N * D −(C −a0[C])) by the size of one chunk (C). We then take the ceiling of the result to consider the case where the last chunk of the array is smaller than C.
In the context of simple matrix multiplication or convolution, a line j of the matrix is generally browsed after the previous one j −1. Given the previously described functioning of a cache, this results in an initial part Lj (bytes) of the array j being pre-loaded at the time we start accessing it. The Equation 2 shows how we have introduced this new parameter within the proposed cache-miss model in order to evaluate nj (number of cache misses triggered while exploring the j th matrix line). This is achieved by reducing the total number of bytes N D by Lj. We also increment the initial address aj of the j th line of the matrix by the same value.
(2) Finally, and for the same time-proximity reason, it is important that we know how much residual bytes 2 from a line of a matrix are loaded with each line. In Equation 3 , we determine the model of that residual rj for a given line j using the same kind of reasoning on the cache-fetch and positioning.
Extension of the Data-Cache-Miss Model to a 2D Memory Blocks
In the subsection 3.1, we considered the case where a matrix is stored within a 1D data layout. The number Lj of bytes already located in the cache when we start exploring the line j is exactly the number of residual bytes rj−1 of data loaded while fetching the line j −1.
Equation 4 gives the total number n * of cache misses triggered during a multiplication of matrices stored within a 1D data layout. We obtain this formula by first replacing Lj by the expression of rj−1 in Equation 2 (using an initial value L0 = 0). We then replace the address aj by a0 +jN D. Finally, we use Weyl 's criterion applied to rational numbers in order to sum the cache misses for each column.
In this section, we consider matrices stored within 2D arrays (as shown in subsection 3.2); the number Lj of pre-fetched bytes for a line j is no longer the number of residual bytes rj−1 3 . In order to represent Lj, we need to model, for the given memory allocator, the distance between two consecutively allocated lines of the matrix. In this context, we have considered ptmalloc (version 2.19), the heap allocator based on Dong Leas Malloc algorithm that has become the default Linux GLIBC implementation [11] . As shown on Figure 2 , a memory block (also known as basic block) allocated using the malloc function of ptmalloc has a size that is a multiple of B, where B is usually equal to 8 or 16 depending on the processor architecture. The returned block contains a reserved section (tail) of size T = 8 Bytes at its end.
2 Residual is the number of bytes that do not belong to a given line but that are fetched along with it into the cache. 3 Even though we still have ∀j ∈ [1,N −1],Lj ≤ rj−1 It is also allocated at an address that is a multiple of B (however the payload data maybe shifted within the basic block). ). Consequently, the distance Dj between two arrays of size N D allocated consecutively is Dj = kB − N D − S where S is the shift of each array within its relative basic block. In Equation 5 we have estimated the number pre-fetched bytes LJ for an array j of a matrix. It is obtained by retaining the distance Dj from the number of residual bytes rj−1 of the previous array j −1. 
RESULTS AND DISCUSSION
In this section, we experiment the correctness and the accuracy of the generated performance-models. In this paper, the accuracy refers to how precisely a model allows to spot the parameters (WH, SW and OS related) that may significantly influence the performance of the corresponding kernel. In fact, our objective is to find such parameters for a given source-code in order to tune them. Any further understanding for "accuracy" is not relevant for our approach of SW optimization. Consequently, we compare non of the cache-miss models that our method generates with experimental evaluations.
Experimental setup
All the presented performance results are obtained following the same experimental protocol. Each considered point is assessed (experimental run) 10 times 4 . The corresponding values
