The performance of OpenMP applications executed in multisocket multicore processors can be limited by the memory interface. In a multisocket environment, each multicore processor can present a performance degradation in memory-bound parallel regions when sharing the same Last Level Cache (LLC). We propose a characterization of the performance of parallel regions to estimate cache misses and execution time.
Introduction
Performance on shared memory systems must consider multicore multisocket environments, with different sharing levels of resources in the memory hierarchy. To take advantage of shared memory systems, the high performance computing community has developed OpenMP Application Program Interface (OpenMP) defining a portable model for shared-memory parallel programming. However, depending on the memory utilization, the memory interface can become a bottleneck. It is possible to group threads to take advantage of sharing memory or, on the other hand, distribute them in the memory hierarchy or restrict their number to avoid degradation due to memory contention.
To this aim, we propose a performance model based on characteristics of the multicore multisocket architectures and the application memory pattern. The model estimates the runtime of an application for a full set of different configurations in a system regarding the thread distribution among cores (affinity) and number of threads. The model is evaluated using runtime measurements on a partial execution of the application in order to extract the application characteristics.
To develop our approach we have made the following assumptions: 1) The application is iterative and all iterations have uniform workload; 2) Workload is evenly distributed among threads; 3) Performance degradation is mainly generated by memory contention at LLC; and 4) All processors in the socket are homogeneous. Our input parameters for the model are based in the measurement in a single socket execution.
Taking into account these assumptions, our contributions are the following:
• A performance model to estimate the LLC misses for different affinities at the level of individual parallel regions.
• A performance model to estimate the execution time for a parallel region, considering an empirical value to adjust the parallelism degree at the memory interface level and data access pattern.
The experimental results show that using the estimated values and selecting the best configuration, a significant improvement in speedup is achieved.
This paper is structured as follows. Section 2 introduces related work about analytical performance modeling. Section 3 introduces our performance model for estimating total cache misses (TCM) at the last level cache (LLC) and estimated execution time. The model is validated in Section 4, where it is evaluated using the SP and MG benchmarks for two different architectures. Section 5 summarizes our conclusions and describes our ongoing work.
Related work
There are several approaches to estimate shared memory systems performance. Tudor [7] presents a performance analysis for shared memory systems, and a performance model. It is validated with NAS parallel benchmarks. This model considers idleness of threads, which is present in context switching, specially when more than one thread per core is executed. We consider our model to focus on the cache behavior because memory contention is the main cause for performance degradation in memory bound HPC applications.
We use the idea of performance degradation in the context of parallel executions. Following this, [2] presents the impact of cache sharing. The analysis is based on the characterization of applications on isolated threads and, Zhuravlev in [9] presents two scheduling algorithms to distribute threads base on miss rate characterization. Dwyer et al. [3] present a practical method for estimating performance degradation on multicore processors. Their analysis is based on machine learning algorithms and data mining for attribute selection of native processor events. We also obtain information from performance hardware counters but without using database knowledge obtained on a postprocessing analysis, that information is obtained by using empirical data from a reduced sample of data that could be achieved at runtime.
Regarding the hardware, the Roofline model [8] is a visual computational model to help identifying applications characteristics such as memory bound limitations. This model shows how operational intensity can provide an insight of architecture and application behavior, and provides an insight of the architecture, however this model is oriented to help development and provide suggestions to make code optimizations on the source code. In our case, we present a model in order to select an affinity configuration with the aim of being used at runtime in an automatic tuning tool.
Performance Model proposal
Performance degradation in memory bound applications considered in this work can be produced depending on application data access pattern and its concurrency at cache level. Therefore, characteristics such as workload and data partitioning, the degree of data reutilization of the data access pattern based on temporal and spatial locality, data sharing between threads, and data locality on the memory hierarchy must be considered. Consequently, a deep knowledge of the application behavior and system architecture to improve performance is required.
Iterative applications can provide similar performance among iterations. For this case, it is possible to apply a strategy (Figure1) to evaluate the behavior of the application for a reduced set of iterations with different configurations regarding the degree of parallelism and thread pinning configurations. Our model considers these measurements to estimate the execution time for the total set of configurations in the system.
Defining the performance model
In order to apply the proposed model, NC executions with parallel region profiling are required, NC being the number of cores in a single socket, the i-th execution runs on threads 0 to i − 1. This allows us to obtain the model's input parameters for time and hardware counters (LLC MISSES). We consider that ideal run time is mainly altered by memory contention at shared cache level, and this contention is measured by the LLC M ISSES hardware counter, which provides the number of inclusive miss events at LLC for the system architecture, meaning that the data is not present on the socket and must be acquired on memory. We collect the total cache misses (TCM) generated at last level cache for each parallel region in order to analyze the concurrency overhead.
The parameters involved in our model are described on Table 1 .
Model input parameters
We propose to measure performance degradation on an isolated socket. Therefore, the model considers two known elements, the increase of TCM on a single socket due to concurrency, and its overhead time (taking into account the parallelism at memory level). Concurrency behavior in a single socket at last level cache is represented by the vector (CF ) of concurrency factors, defined in expression 1.
Where each cf i is the relation, defined in expression 2, between the measured T CM for a 1 thread execution and the measured cumulative T CM for an execution with i threads in the socket. This vector can be generated for each parallel region.
On the other hand, to estimate the overhead time generated on memory accesses, we must consider that the memory interface is capable of achieving a degree of parallelism resolving the access requests. The full utilization of parallelism depends on the application data access pattern. Therefore, to express the relation between the achieved memory parallelism on a single socket and the application behavior, we define the vector (BF ) of β factors in expression 3. These values are also obtained with the measured values in a single socket execution.
Each β i factor (defined by 4) represents, for the i threads execution in a socket, the relation between the measured time (mmT ime), and the overhead for the worst case scenario, providing a ratio of memory parallelism. The worst case is a serialized data miss access with no memory parallelism, implying a latency overhead per data miss. Also, we consider ideal time (idealT ime i ) as T1 NTi , being T 1 execution time for 1 thread, and NT i the number of thread for the i-th execution.
Following this, to represent the set of possible configurations in a system with NS sockets, the thread configuration is represented in expression 5 as the affinity vector AF F , where each component represents the number of threads in the s − th socket.
where s ∈ 1..N S
The maximum number of threads in each socket is NC, allowing a number of configurations from 1 thread to NS * NC. This definition allows us to consider configurations independently of thread positioning on the socket, that is, by considering homogeneous threads, where a thread and its siblings in a socket are equivalent. Furthermore, configurations with the same number of threads per socket but with different socket order are also considered equivalent (e.g. AFF={1,2} is equivalent to AFF={2,1}).
Finally, this definition provides a number of possible configurations numConf = NC+NS NS − 1, being NC the number of cores per socket, and NS the number of sockets in the system. Considering this, the model provides the estimation for all the different numConf affinities (AF F ) in the system, and allows to select the configuration with the minimum estimated execution time.
Estimating TCM & Execution Time
In order to estimate the TCM generated in a socket from a given affinity configuration, we represent the estimated T CM by expression 6.
Where s is the number of socket, and NT (AF F ) expresses NS x=1 af f x , i.e., the total number of threads for the AF F configuration.
Finally, time estimation for the affinity configuration is given by the ideal execution and the overhead time (T Ovhd) as shown in expression 7.
Where T Ovhd(AF F ), presented in 8, is the calculated overhead depending on the data access pattern. If the pattern is unknown, the T Ovhd(AF F ) value can be interpolated between the best and the worst case scenario. The serialized access pattern considers the worst case scenario, summation (SUM) of all the socket overhead, and on the other hand, the best case scenario is presented by the fully parallel memory access between sockets (MAX), using the maximum value overhead estimated on all sockets.
Therefore, in order to describe the overhead time per socket we define T Ovhd(AF F, s) expression 9 that represents the overhead generated by T CM in a socket minus idealT CM af fs , which is corrected with the β value, that corresponds to its concurrency degree (af f s ) measured in a single socket. The idealT CM af fs is obtained from This model provides the execution time estimation for the AF F vector configuration, just by considering the values of a single socket execution, and can be applied for all the affinity configurations present in the system. Selecting the optimal configuration is not always trivial, but, applying the model, it is possible to provide an estimation for each configuration an select the one with minimum execution time.
Experimental validation
In this section we present the experimental validation of the proposed performance model. We have used two different architectures (Table 2) , T7500 and SuperMIG, and representative regions of interest for the memory bound applications SP (scalar pentadiagonal solver), and the MG (Multi-Grid) benchmarks from the NAS Parallel Benchmarks [1] NPB3.3.1-OMP, using different workloads.
Firstly, we introduce application and system characterization. Next we present the validation of the model on the T7500 system with two sockets per node and 6 cores in a socket, and the validation of the model on the SuperMIG system with 4 sockets and 10 cores per socket, allowing us to evaluate the model for a greater number of configurations.
By using the definition of AF F provided in the previous section, the total number of possible configurations (numConf ) for the T7500 system is 27, and for the SuperMIG system is 1000.
The SP application has 4 principal parallel regions, where 3 parallel loops (at x solve, y solve, and z solve functions) represent each one about 15% of the total execution time, and one parallel region (at the rhs function) with inner loops representing between 20% and 40% of the execution depending on the degree of parallelism. The MG application presents 2 parallel loop regions of interest, Reg 011 (mg.f 614-637) and Reg 013 (mg.f 543-566), representing from 28% , and 16% respectively of total execution time.
To compare the measurements and the estimations, we have executed them for different number of threads and representative affinities. We have used the ompP [4] profiler to obtain performance information at application and at parallel region level. Also, ompP is integrated with PAPI [5] to obtain hardware counters information. We considered the full profiling information for the MG benchmark, and a reduced number of iterations for the SP benchmark, being 100 iterations for class C, and 10 iterations for class D.
Information given by PAPI is based on preset counters. We observe that the load (LD INS), store (SR INS), total (TOT INS), and floating point (FP INS) instructions are distributed evenly between threads. TCM for cache levels 1, 2 and 3 (L1 TCM, L2 TCM, and L3 TCM) have been evaluated to characterize the memory contention problem of the applications. The execution with likwid-pin tool [6] allows to pin threads to cores in order to evaluate the affinity. The affinity labeled as AF F 0 assigns threads to cores at the same processor, until it is full. Affinities AF F i define a Round-Robin distribution between sockets from a list of current threads to be executed, where i represents the chunk size of threads from the list to assign to each socket, and until the socket is filled. For example, in a two socket system with 6 cores per processor, execution of 9 threads with AF F 3 assigns the first 3 threads to socket 1, next 3 threads to socket 2, and the last 3 threads to socket 1.
The numatcl utility has been used to evaluate the behavior for different memory mappings, by using two configurations, localalloc to force allocation closer the the master thread, and interleave=all, where memory is allocated evenly between all set of NUMA nodes.
Applying the model for the SP application on the T7500 system.
In this section, we apply the model to a parallel region of interest to evaluate the NAS SP class C on T7500 system, in order to compare the model estimation against the execution times for two different affinity distributions.
The information from the profiled execution on a single socket is used, considering the values from 1 thread to total number cores per socket (# cores per socket. in Table 2 ).
First step is to compute the CF vector and BF vector using TCM and times per parallel region. Input data is shown on Table 3 .
Following this, the CF is used to estimate the TCM for a specific AF F configuration. In this example, if we consider AFF1, distributing threads from 1 to total number of cores in the T7500 system, in a Round Robin distribution, we obtain the different configurations expressed in Table  4 , shown in column (NT (s1, s2) ). Applying expression 6 for each combination of number of threads in the sockets we obtain the estT CM (AF F, i) per socket and the cumulative estimation Cum.T CM , which is presented in column ECT CM . For this configuration, the relative error of the estimated TCM and measured TCM is presented in column %RE.
Relative error is less than 20%, and we can observe that our estimation represents the behavior of the measured values.
Using the estimated TCM, we apply expression 7 in order to obtain the final estimation time (estT ime(AF F 1)) for the affinity 1. For this case, we evaluate two different estimations, one by considering a serialized memory access and a second one that assumes an ideal parallel memory access. Therefore, the first case considers the overhead as the summation of overhead times per socket, and the second assumes full parallelism on memory accesses, implying that the overhead time is generated by the slowest socket, therefore by the maximum time estimation of sockets.
Both estimations are shown for the two affinity distributions ( 0 and 1 ) presented in Figure2. Figure 2 shows that the measured time is in between the two estimated boundaries, and in this case is similar to EstimationM ax., meaning that the memory accesses are parallelized between the sockets. Furthermore, the EstimationM ax. presents the same behavior and lead us to identify the best configuration, which in this case is the AF F 1 using 6 threads ( equivalent to socket configuration {3,3} ), and median error for the best estimation is 5%, and the average error is less than 8%.
Selecting a configuration for SP and MG benchmarks on Super-MIG
We present the application of the model for SP and MG, with different workloads, on the SuperMIG system. The experiments are configured to evaluate the two boundaries at memory level. We use the numactl tool to allocate memory near to master thread (localalloc), to achieve a serialized memory access at socket level, and interleaved allocation (interleave=all) to force data distribution between sockets and parallel memory accesses.
The model is applied considering the single socket measurements and the results are shown in Table 5 .
We can observe on Table 5 for SP benchmark that local allocation provides a serialized memory access. This is because data needs to be accessed through the same socket, and this contention provides a serialized behavior. For the distributed allocation, the memory access pattern allows more parallelism, improving performance and minimizing the memory bottleneck. The model has provided a configuration with minimum execution time and an average error of less than 14%. MG has been forced with local allocation, however, it uses a different data access pattern and higher workload. We have observed that memory access is not fully parallelized neither serialized, therefore we used the closer boundary M ax.Estimation, which not represents exactly the data access pattern increasing the error.
Exploration of the affinity configurations.
In this section we discuss the benefits of applying the model in a system with multiple sockets, and the speedup achieved by allowing the selection of a configuration with the model compared to the execution with all threads.
The main point is to rapidly detect memory bottlenecks in parallel regions, and select a configuration that minimizes the contention overhead. Also, to provide an estimation approach for all the configuration ranges without a full execution.
We present a model that provides an estimation for all the configuration ranges, which can be applied with a minimum characterization on a single socket. Figure 3 shows a subset of 10 configuration affinities (considering the definition in 5) for the SuperMIG system. Figure 3 shows the measured times and the estimated execution times. We can observe that 3(a) present a memory contention problem when using a full thread execution. The minimum for measured and estimated execution times is shown on a contour surface. The minimum execution time is achieved by using about 20 threads on the configuration that provides less concurrency per socket (e.g. AFF1(20)= {5, 5,5,5} , that is, using half threads per socket ).
Figure 3(b) shows that MG does not present significant variation between affinities, and time is reduced using more threads.
Finally, we present in Table 6 the comparison between an unguided execution using all threads, and the configuration provided by the model. The speedup is calculated using the measured time for full execution and measured time for the selected configuration.
Even though the ideal configuration is not detected for all cases, the selection has provided a configuration with a maximum speedup of 2.74, for the SP class C, with an affinity 1 with 20 threads. Also, the minimum speedup is 1, meaning that the application does not shows memory contention, neither benefit from reducing the number of threads or modifying the affinity.
Conclusions
We have presented a performance model to estimate the LLC misses and to estimate the execution time based on an execution of a small set of configurations. This model allows to estimate any possible configuration of affinity and number of threads for the system. The performance model has been applied for the NAS SP and MG applications for classes C and D in two different architectures. The results show an average time error of less than 14%. Despite the error, the time estimation preserves the measured behavior that lead us to select automatically a configuration, and the possibility to improve performance compared with the default configuration.
Our model can rapidly detect memory bottlenecks on each parallel region in an application, and it is possible to identify a configuration that minimizes the contention overhead.
We are analyzing the results in order to improve the estimation between boundaries (Max and Sum) when the memory access pattern of an application is not completely serialized or parallel. Furthermore, when the boundaries are widely separated, and the measured time is in between, the error increases. In order to estimate with more accuracy, it is needed to consider the overhead on accessing data between different sockets which some native hardware counters can provide.
Finally, we are currently evaluating real applications on different architectures in order to extend the validation of the model.
Acknowledgment

