Optimal code performance is (besides correctness and accuracy) the most important objective in compute intensive applications. In many of these applications, Graphic Processing Units (GPUs) are used because of their high amount of compute power. However, caused by their massively parallel architecture, the code has to be specifically adjusted to the underlying hardware to achieve optimal performance and therefore has to be reoptimized for each new generation. In reality, this is usually not the case as productive code is normally at least several years old and nobody has the time to continuously adjust existing code to new hardware. In recent years more and more approaches have emerged that automatically tune the performance of applications toward the underlying hardware. In this article, we present the MATOG auto-tuner and its concepts. It abstracts the array memory access in CUDA applications and automatically optimizes the code according to the used GPUs. MATOG only requires few profiling runs to analyze even complex applications, while achieving significant speedups over non-optimized code, independent of the used GPU generation and without the need to manually tune the code.
INTRODUCTION
With the demise of Moore's Law, one key direction in the development of processors has changed from shrinking the architectures to establishing specialized multi-or many-core processors. This allows one to further improve the compute power of today's computers using parallel computation without the physical limitations imposed by further shrinking the architectures. As the word
The work of Nicolas Weber is supported by the "Excellence Initiative" of the German Federal and State Governments and the Graduate School of Computational Engineering at Technische Universität Darmstadt. We want to thank Stefan Guthe for his help analyzing the optimized code. Extension of conference papers. The additional contributions of this article over the previously published work of Weberconfigurations. Further, we introduced an adaptive runtime component in Weber and Goesele (2016) that dynamically selects optimal configurations according to automatically gathered metadata. The contributions of this article are as follows:
-A unified and integrated description of the MATOG components published in the previous conference publications. -An improved application analysis algorithm that further simplifies the application analysis, enabling one to analyze even extremely complex applications within seconds. It limits the available search space while achieving equal results to an exhaustive search. -A new runtime decision-making system that takes not only the most recent but also previously collected metadata into account. -A nearest-neighbor search based decision model that implicitly normalizes metadata, achieving higher performance in some benchmarks compared to a Support Vector Machine (SVM). -An extensive evaluation on a broad range of six applications on eight GPUs, ranging from the Fermi up to the Pascal architecture, i.e., four architecture generations.
This article is structured as follows: Section 2 gives an overview of other auto-tuners and where MATOG is positioned in relation to these; Section 3 introduces the concept behind MATOG and its programming interface; Section 4 explains how we analyze CUDA applications and how to determine optimal configurations; Section 5 explains how our decision making is performed during runtime; Section 6 evaluates the overall system on a series of different CUDA applications; and in Section 7 we discuss our results and conclude with an overview on open issues and possible future work.
RELATED WORK
This section gives a brief overview of prominent auto-tuners, their optimizations, and used techniques. First, an overview of what can be optimized in GPU applications is given, followed by how these optimization approaches are integrated into the applications, how optimal configurations are found, how dynamic auto-tuners identify different data properties, and how this information is used during runtime to select optimal configurations. We mainly concentrate on GPU auto-tuning but also include generic auto-tuning approaches.
"What to Optimize?"
The variety of available optimizations in applications is nearly endless. It starts with simple and easily applicable optimizations such as compiler flags (Ansel et al. 2014) . As compute-intensive applications usually iterate over a lot of data, loop transformations (Hall et al. 2009 ) and optimized launch configurations (Bergstra et al. 2012; Liu et al. 2008 ) are very important. Lutz et al. (2015) optimize the usage of Application Programming Interface (API) calls to optimally overlap memcopy and execution. Magni et al. (2014) apply thread coarsening by fusing multiple work items. As GPUs are usually used due to their high compute power and memory bandwidths, an optimal data access is necessary. This can either be done by specifically targeting particular operations such as in 3D stencil codes (Lutz et al. 2013) , dense (Sørensen 2012) or sparse (Choi et al. 2010; Monakov et al. 2010 ) matrix vector multiplications, or by applying application-independent optimizations, e.g., tuning the cache utilization (Li et al. 2015) , data mapping and partitioning (Ben-Nun et al. 2015) , data placement (Li et al. 2015b) , or data layouts (Kofler et al. 2015; Peng et al. 2016; Edwards and Trott 2013) (while the latter only provides an optimization API and no auto-tuning). Sung et al. (2012) propose a framework that transforms access to arrays of struct type based on static rules and explicit data transformations prior to the kernel execution. Hsu et al. (2014) go one step further 28:4 N. Weber and M. Goesele and propose a hardware controller that performs the necessary layout transformations during the memcopy, eliminating the conversion overhead. More general are approaches such as Muralidharan et al. (2014) , Ansel et al. (2014) , and Nugteren and Codreanu (2015) that allow user-specified optimizations, e.g., alternative implementations and algorithms. MATOG mainly optimizes data layouts, placement, and utilization of caches, but also allows user-specified optimizations.
"How to Integrate Optimizations into the Application?"
There are numerous ways to integrate optimizations into an application. Most prominent is the usage of compilers. Some approaches use code annotations (Lee and Vetter 2014; Han and Abdelrahman 2009; Liu et al. 2008 ) to generate parallel code for given serial code, or use them to apply optimizations such as loop transformations. Li et al. (2015a) apply their optimizations directly to the PTX intermediate format. Auto-tuners that allow user-defined optimizations (as mentioned before), usually require using a specific API to instruct the auto-tuner what to optimize and annotations or macros inside the code, which enable/disable features or adjust parameters. For MATOG we decided to use code generation, which has to be manually integrated into the application and replaces array data types with a MATOG-specific implementation. This is less comfortable to use, as it requires some manual work, but makes the code easily portable between different operating systems and compilers. Our API mimics the CUDA Driver API and therefore requires only a few changes to existing CUDA applications.
"How to Find Optimal Configurations in Short Time?"
Finding optimal configurations or values for parameters is essential for auto-tuning. Mainly there are two methods: to analyze the code (static analysis) and to measure the execution time (profiling). Peng et al. (2016) recently proposed a metric to estimate if either Array of Structs (AoS), Structure of Arrays (SoA), or SoAoS, a hybrid format between AoS and SoA, yields in better performance. The advantage of such static analysis methods over profiling is that it can be performed quite fast, but ignores data-dependent effects that can occur, as it operates on heuristics which make certain assumptions of the data and workload of the application. Further, as the interior of GPUs is proprietary, it is difficult to establish suitable models for static analysis methods, especially if a new architecture with lots of internal changes is released. On the over hand, empirical profiling is quite time intensive, so that numerous different approaches have emerged to reduce the necessary time. Kofler et al. (2015) use a hybrid approach, as they utilize a generic GPU benchmark to capture GPU-specific properties and then project them onto the code to estimate the performance using a directed execution graph. Magni et al. (2014) train a neural network on a dataset consisting of static code features and best candidates for some reference benchmarks. With this trained model they are able to predict optimal thread coarsening factors based on the static code features. Other approaches use methods such as an exhaustive search (Muralidharan et al. 2014) , greedy search (Liu et al. 2008) , swarm search, simulated annealing (Nugteren and Codreanu 2015) , Nelder-Mead downhill simplex method (Chung and Hollingsworth 2004) , or a genetic algorithm (Ansel et al. 2014) to guide the search into the direction of the optimal configuration. In Weber et al. (2015) , we proposed a prediction-based empirical profiling method, that requires only a very small subset of configurations to estimate the execution time of non-profiled configurations. This is based on the observation that changes induced by one optimization have little influence on others. With this method, it is possible to easily estimate the optimal configuration with only a very limited number of executed configurations, making this method very fast while it achieves nearly the same results as an exhaustive search. Fig. 1 . Execution time for two kernel runs with different data. All results have been normalized between 0.0 (best) and 1.0 (worst), and are sorted ascending for the first call (black). Configurations which were optimal for the first call, became the worst in the second (blue) execution. The shown kernel is the first main kernel of our KD-Tree benchmark. The big jumps come from using local memory as a buffer for shared memory. This works well for the black iteration, but not for the blue one.
"How to Identify Different Data Properties?"
To dynamically select optimal configurations during runtime requires that the auto-tuner is able to identify different data properties. Approaches such as Muralidharan et al. (2014) and Chung and Hollingsworth (2004) use user-defined callback functions. This requires that the user knows of the different properties of his data and how to characterize these. In contrast, we use a fully automatic process that relies on metadata that is already in our system (i.e., the launch configuration of a kernel or the size of arrays). However, MATOG also supports monitoring of user-defined variables, which can contain additional information.
"How to Select Optimal Configurations During Runtime?"
With the ability to identify different data properties, it is possible to create decision models based on machine-learning techniques, e.g., regression trees (Bergstra et al. 2012 ) orSVMs (Muralidharan et al. 2014 ) that adjust the application accordingly. In Weber and Goesele (2016) we relied on SVMs, but encountered cases where these performed badly. We now realized that, as the value range for our metadata is unknown, the SVM prioritizes metadata dimensions whose values differ significantly from the training data, causing bad decisions. So we experimented with several other approaches and came up with a nearest-neighbor search that employs an implicit data normalization, yielding equal or better results compared to a SVM.
MATOG AUTO-TUNER
MATOG is an application domain independent array layout auto-tuner for CUDA applications. It has been introduced first in Weber and Goesele (2014) and has been constantly improved since then. MATOG is open source and can be downloaded from our project page. 1 The main concept of MATOG is that it abstracts the memory access from the application and dynamically chooses optimal array layouts during runtime. MATOG is meant to work for arbitrary NVIDIA GPUs. In order to achieve its goal, MATOG requires one to run and analyze the application with real data. During this empirical profiling, it measures the execution time of different data layouts, tracks metadata, and stores the results in a database (Section 4.2). These profiling results are then analyzed (Section 4.3) and used to build decision models that are utilized during runtime to select optimal configurations according to changing data properties (Section 4.4). As shown in Figure 1 , there Listing 1. Example code to show the define/range feature. The first code snippet shows how different implementations can be differentiated using a preprocessor definition. The second shows how a range of values can be evaluated.
can be different optimal configurations of a kernel, depending on the input data. We automatically identify different data properties and adjust the layouts accordingly.
MATOG optimizes array access for one-or multi-dimensional arrays of primitive, struct, and hierarchical types. Pointers can currently not be used as data type. It arranges these arrays in different struct layouts, i.e., AoS, SoA, or AoSoA (often referenced as tiled-AoS (Kofler et al. 2015) or ASTA (Sung et al. 2012) , with a tile size of 32, the size of a CUDA thread group). Further, multidimensional arrays can be stored in a transposed way. The array layouts can be separately optimized for global, shared, and local memory. For read-only arrays residing in the global memory, MATOG determines which of these arrays should use the default and which the non-coherent cache (also known as texture memory) to optimize the cache utilization. Arrays with constant size (known at compile time) and content can further be placed in constant memory. On Fermi and Kepler GPUs, the optimal ratio between L1 cache and shared memory size is automatically determined. Additionally, the programmer can define preprocessor optimizations, to use discrete values and value ranges in his code. This can be used to implement different algorithms for a task, or to evaluate a series of values as parameters of an algorithm as shown in Listing 1. Experienced users can further customize MATOG by defining their own indexing schemes, e.g., to implement a special triangular matrix, or to provide their own allocation/deallocation mechanism, in order to combine MATOG with other frameworks.
Programming Interface
The MATOG programming interface was designed to reduce programming overhead and to be as compatible to CUDA as possible. To support platform independence and easy maintainability, we decided to use code generation instead of an own source-to-source compiler, so no changes to the compile chain have to be applied. MATOG requires only small changes to the source code as it mimics the CUDA Driver API (NVIDIA 2016) interface. It is necessary to use the Driver API instead of the Runtime API as the latter does not allow one to dynamically load and exchange kernel implementations during runtime. To access data MATOG supports a multi-dimensional memory access (e.g., array [x] [y].subarray [z] .field) similar to the code skeletons used by GROPHECY (Meng et al. 2011) . To apply its optimizations, it intercepts all communication to CUDA. 
Programming Example
In order to use MATOG arrays, the programmer has to provide a JSON description (example shown in Listings 2 and 3) of the arrays, which is then used to generate the optimization code. This code can then be included into the application. The main contribution of MATOG is a dynamic selection of optimal parameters at runtime. However, MATOG automatically performs some trivial static optimizations, e.g., it ensures an optimized alignment inside the arrays to prevent alignment spacings (example shown in Figure 2) . Further, the user can specify for each kernel how data is used (only read, only written, or read and write), which enables certain optimizations. This is explained in Section 4.3 in more detail. The programmer can further specify whether multiple arrays always have the same size. This allows one to reduce the number of used registers, as normally every MATOG data structure maintains a separate copy for its sizes. Finally, compiler flags can be specified to enable GPU debugging or specific features (e.g., --use_fast_math). There are three types of MATOG data structures that can be used in the host code: one for arrays located in host memory (Array::Host), one representing dynamic shared memory (Array::Dyn), and one for device memory (Array::Device). Data inside host arrays can be directly accessed. Data transfers are done by the CUDA memcopy functions. For kernel modules, the source code has to be provided instead of a pre-compiled CUDA module. MATOG takes care of the actual compile process in a Just-In-Time (JIT) manner. Nothing else has to be changed for loading and executing GPU code, as MATOG array references can simply be put into the kernel argument list. Listing 4 shows a host code example. Changes are indicated compared to CUDA Driver API. Inside the kernel code, each memory type uses separate implementations. As multiple instances of global or dynamic shared memory arrays can have different layouts, a template is used to distinguish the different instances. Only for static shared memory types, the same layout is used for all instances of the same type. Listing 5 shows an example for kernel code. Figure 3 shows a schematic workflow of MATOG. In the development phase the user specifies the optimizable data structures, which are then generated by the code generator. The user then incorporates these into the application and compiles it using a standard host compiler (e.g., GCC or Visual Studio). Listing 6 shows an example for using a MATOG-based application and how to execute the optimization procedure. By default, MATOG runs in an unoptimized mode. To optimize it, it has to be switched into a profiling mode, which measures the execution time of the kernels. Every time a kernel is executed during optimization, MATOG runs multiple implementations of the same kernel to determine which layouts work best.
Listing 6. Example for running a MATOG application, including the optimization process. If steps 1-7 are omitted, the application will run unoptimized using default implementations for the data structures.
After the profiling, MATOG analyzes the results and builds decision models. These are used to determine optimal layouts during runtime.
APPLICATION ANALYSIS
To optimize an application using MATOG, it is necessary to determine which configurations achieve optimal performance. As MATOG is meant to work for arbitrary NVIDIA GPU architectures and applications from all kinds of application domains, we treat GPUs, kernels, and optimizations as black boxes. In particular, we assume that we do not know anything about their specific properties, except for which optimizations can be applied for a given GPU, a particular kernel, and the hints given by the programmer. This potential lack of information might complicate the analysis but guarantees that our concept can be applied to any application on any past, current, and future hardware, as long as the optimization to hardware relation is modeled and as long as the code can actually be compiled on the specific hardware. In the following section, we formalize our optimization problem. Then we explain our algorithm to find optimal solutions for a particular application on a given GPU. We use a three-step application analysis consisting of empirical profiling (Section 4.2) to determine which optimizations are optimal, offline analysis (Section 4.3) to determine the application-wide optimal solution, and decision models (Section 4.4) that select data-dependent optimal solutions during runtime.
Optimization Problem
Our optimization problem has multiple optimization dimensions: If an array uses struct types, we can optimize the struct layout (d L ∈ [AoS, SoA, AoSoA]). For multi-dimensional arrays, we can optimize the transposition (d T ), where the number of possible values is the factorial of the number of array dimensions. If data is only read from arrays residing in global memory, we can determine if we use the default or non-coherent cache hierarchy, or place the data in constant memory (d M ∈ [Default, Non-Coherent, Constant*] *if size is known at compile time). On Fermi and Kepler cards, we can further determine the size of the L1 cache (d L1 ∈ [Prefer SM, Prefer L1, Prefer EQ*] *only for Kepler) by trading with shared memory. Additionally, we can select different implementations using user-defined preprocessor instructions (d D ∈ [...]) and value ranges (d R ∈ [min, min + step, . . . , max]). As can easily be seen, the number of dimensions for a single kernel can be very high as a single array already can have up to three optimization dimensions. Further, we can see that the number of values for a dimension are very low. This causes a very high-dimensional optimization problem with a very limited extend but a very high number of configurations (C) for a given kernel (k ∈ K), as this is defined by |C k | = d ∈D |d |. This gets even more complex when multiple kernels are used as this results in |C | = k ∈K |C k | configurations.
Step 1: Application Profiling
As processor architectures can significantly change from one to another generation, purely analytical models can break, as new or changed features no longer perform as modeled. Therefore, we solely rely on empirical profiling without any code parsing or specific GPU model. This removes the necessity of constantly updating the used analytical GPU model with each new generation. However, the mentioned high number of configurations is a big problem for empirical profiling. These configurations not only have to be executed but must also be compiled, as the optimizations for the GPU code have to be explicitly compiled to achieve maximal performance. Further, to gather the execution time of the application it has to be restarted in each configuration, which leads to massive overhead especially if a huge amount of I/O is taking place. To tackle these problems, we used a specialized profiling technique in Weber et al. (2015) which relies on an in-application profiling (similar to the NVIDIA CUDA profiler) and a prediction algorithm that performs a specialized search space pruning. For time measurements we rely on the CUDA Profiling Tools Interface (CUPTI). To validate correctness, MATOG not only checks whether configurations can actually be executed (e.g., do not exceed constant memory limitations or try to use texture memory on nonread-only arrays) but also has a verification mode that can be activated during profiling to verify that all configurations produce the same results.
In-Application
Profiling. An in-application profiling has the advantage (compared to restarting an application) that it does not require one to repeat any application setup and finalization procedures, which can be very time-consuming depending on how much and from where data has to be read or written to. In addition, the in-application profiling reduces the number of configurations to test from k ∈K |C k | to k ∈K |C k | as we can calculate the execution time for all permutations of kernel configurations and do not need to explicitly measure them. One drawback of this method is, however, that it requires much more memory, as data has to be duplicated so it can be restored prior to (re-)executing the same kernel, in another configuration. Further, it has to be converted into other data layouts. In MATOG, we copy the data to the host system prior to each kernel execution and start parallel CPU threads to convert data if necessary. We compile all necessary kernel configurations in parallel to the GPU profiling, overlapping compilation and execution. This is realized using multiple threads so it does not interfere with the profiling of the kernel execution.
Prediction-Based Profiling.
In Weber et al. (2015) , we have introduced a prediction-based algorithm that requires only very few samples to estimate the performance of the entire optimization space for a single kernel. This is based on the observation that many optimizations do not influence others. Formally, this means that many dimensions are linear independent of others and can be optimized independently of these. This allows us to use a difference model to predict the execution time of a kernel in a specific configuration (p(k, c p )). For this we require a reference point that we call base configuration (c b ) (that can be arbitrarily selected). Further, several additional data points are required, called support configuration (c s,d ). For these, all values are equal to the base configuration, except for the value of one dimension, d. To predict the performance of a specific configuration, we then use the difference (Δ(t 1 , t 2 ) = t 1 − t 2 )) of the execution time between support and base configurations, sum them up, and add the time of the base configuration.
.
(
However, in reality this does not always work, as not all dimensions are independent of the other ones. This is caused by the fact that some optimizations, e.g., the L1 cache size (d L1 ), significantly When AoS is used, each iteration of the inner loop (although executed in parallel) requires one to read a separate non-connected data line. In contrast, for AoS T only three neighboring lines have to be read. SoA T is the best layout that only needs to read two consecutive lines.
influence the availability of resources and can thus lead to strongly varying hardware utilizations. Other dimension types (e.g., layouts or transpositions) are mostly independent and hardly change the amount of used hardware resources. Given this influence, we have to modify the prediction formula. First, we divide all dimensions into two sets. The first contains independent dimensions (D I ) and the second contains all that have an influence on others, which we further call shared dimension (D S ). For each value combination of the D S , we create a separate prediction domain. Only configurations that are located inside this domain can be used to predict configurations in this domain. Simply spoken, this means that each domain requires its own base configuration and consequently, also corresponding support configurations. For the profiling, this means that we have to execute all D S ⊗ D I combinations to gather all required measurements to estimate the performance of all other configurations (shown in Figure 4 ). The number of these configurations is significantly smaller than for an exhaustive search, as can be seen by
(2) Figure 4 shows an example on how the support configurations are selected for the prediction. Note that we empirically showed that his works on NVIDIA GPUs but cannot provide any proof of why this works, as the GPUs and the compilers are proprietary and their specification not publicly available. We can, however, provide the following argument: Let us assume a very simple kernel with a 5 × 5 AoS consisting of two fields, a memory access as shown in Figure 5 , a device with four hardware threads, and a memory controller that can fetch one data line with four adjoining items Fig. 6 . Left: The measured performance (black) of 5,184 configurations sorted from best to worst, our predicted performance (green) based on measured support (blue cross), and base (red cross) configurations. Right: A closer look at the section of the best configurations (indicated by the black box). It can be seen that the prediction can be noisy and can slightly deviate from the exact results.
at once. For our theoretical system, each data line load requires 10 clock cycles and one additional clock cycle for reading data lines that are not adjoined. Further, let us assume that the data can only be represented as AoS or SoA, and can be stored untransposed or transposed. This results in a total of four configurations. Figure 5 shows the memory access for all configurations in the first iteration of the inner loop. The major goal for us is to minimize the required clock cycles for all memory loads. As can be seen, SoA T is the best layout with only two lines to be read. To find the optimal layout, we have to sample AoS as base configuration. We also require two support configurations, which are AoS T and SoA. Their measured execution time is t (AoS) = 54, t (AoS T ) = 30, and t (SoA) = 51 clock cycles for the inner loop. When we apply our predictor (Equation (1)) we get the predicted execution time p(SoA T ) = 27, which is the best result. Although this value differs from the exact value t (SoA T ) = 20, it is still a useful prediction. In fact, the difference is caused by the choice of parameters for our artificial example. However, in a real system we would expect a deviation anyway, caused by noise of the measurements. Figure 6 shows a real example for the prediction of a kernel, where we require three base and 36 support configurations to estimate the performance of 5,145 non-profiled configurations, which is less than 1% of the total configuration count. As can be seen, the prediction is not perfect but it is good enough to select a configuration, which is pretty close to the optimal solution.
Step 2: Determine Optimal Configurations
Given the prediction algorithm, we can calculate the optimal configuration for each kernel execution. Now we have to determine an application-wide optimal solution. This is difficult, as arrays that are used in multiple kernels introduce dependency constraints that need to be resolved. In general, there are two ways to handle these. Either data can be converted between two kernel executions, or not. Without knowing how long a kernel will execute using unknown data, we have no means to predict at runtime whether the conversion will yield enough improvement to compensate the conversion, so we decided not to allow any conversions once a layout has been determined. However, we plan to investigate this further in the future.
Decisions.
As data is allocated at various points during the execution, we may have to apply some data layouts already prior to any kernel executions, e.g., host arrays are usually filled with data before they are copied to the device. We differentiate between three decision events that require action:
(1) whenever a host array is allocated, (2) whenever a device array is used in a kernel (if it has not inherited a layout through a prior memcopy), and (3) whenever a kernel is executed, non-device memory decisions (e.g., shared/local memory, defines, ranges, . . . ) can be determined without any constraints.
There is one special case: If a device array was marked as write-only, the kernel will overwrite the entire content of the array and therefore is not constrained by the existing layout. In this case, MATOG can assign a new layout resulting in a new decision (Type 2). As an allocation can occur multiple times during the execution (e.g., in a loop), every array gets assigned a unique id. This allows one also to use the size of previously allocated arrays as metadata during the decisionmaking process.
Array Dependencies.
To resolve the array dependencies, we previously used an Array Dependency Graph (ADG) (Weber and Goesele 2016 ). An ADG is a directed graph that contains nodes for each array allocation, kernel execution, and memcopy between host and device. This graph maps onto the lifecycle of arrays during the application execution. While the ADG representation fits exactly the flow of the application, it is difficult to parse. We propose here a simplified version, called Decision Dependency Graph (DDG). This is also a directed graph, but instead of mapping to the lifecycle of arrays, it maps onto the decisions made during the execution. There are only two types of nodes in this tree: global and kernel decisions. Global decisions occur whenever the layout of a global memory array needs to be determined (Types 1 and 2). These nodes are connected to kernel decisions (Type 3) by edges. Each edge is labeled by the global decision it originates from and connects all kernels in sequence it is used in. Figure 7 shows an artificial example of a DDG with all possible cases. Depending on the structure of the application, the graph does not need to be fully connected and can be a forest of multiple graphs. This also happens when the application has been profiled with multiple different datasets, where each of these profiling results is a disjunct graph. In the following, we describe our method on one graph, but this can be easily extended to a forest of graphs.
Exhaustive Search.
In Weber and Goesele (2016), we proposed an exhaustive analysis of the DDG. This method guarantees one to find the optimum of the application, but can be very timeconsuming for complex applications with a high number of reallocations. The exhaustive search varies all possible combinations of the DDG and calculates the total execution time for all kernel executions in it using exhaustive profiling data or our predicted execution times. If predicted data is used, the execution can be sped up by splitting all decisions into two categories: local and global. Local decisions influence only one kernel execution, e.g., shared/local memory layouts, usage of non-coherent cache, constant memory, ranges, or defines. Global decisions are all global memory struct layouts and transpositions. This separation allows one to precompute optimal values for the local decisions, for each kernel, as we can store one optimal local value for each possible global decision permutation.
Predictive Search.
The problem with an exhaustive search is that if an application allocates or reallocates many arrays (e.g., in every iteration of a loop), the number of global decision nodes grows rapidly. As the number of combinations that have to be evaluated thus increases exponentially, this method becomes quickly unfeasible. Unfortunately, this often occurs in real applications. We therefore extended our predictive profiling onto the search for an application-wide optimal solution. The assumption of the prediction algorithm implies that we can select optimal values without respecting any interactions between the optimization dimensions. In the first step, we iterate over all global decisions separately and select optimal values for these. We ignore the optimization domains in this step and only search inside one domain of the shared degrees. This is possible, as we have previously mentioned, layouts and transpositions (as employed by global decisions) hardly change the amount of used resources. In a second step, we iterate over all local decisions and search for the best values, but now we obey the optimization domains. We will show in our experiments that this method provides comparable results to an exhaustive search.
Step 3: Decision Models
At this point, we know the optimal solutions for all profiled application runs. Now we have to use this information to build decision models that can select optimal configurations during runtime. For these decisions, we require some kind of metadata to distinguish between different input data scenarios. Auto-tuners such as Chung and Hollingsworth (2004) and Muralidharan et al. (2014) use user-defined callback functions that calculate some arbitrary, user-determined metrics for this distinction. We instead use a fully automatic way. As analyzing and categorizing arbitrary data is a very difficult, often time-consuming task (depending on which analyses are performed), we chose to use the metadata that is already available in our system such as the sizes of arrays and the launch configurations of kernels. In contrast to our previous publications, our improved metadata gathering system not only tracks the most recent arrays in the system, but also previous data, using unique ids for each allocated array. We additionally allow the user to register variables, which are monitored during the execution. These can contain data that has been explicitly calculated by the users, or e.g., counts of items in a preallocated, not-resized array. This information is usually available in the application without any need to specifically calculate these. For our decision model creation, we gather all decision events from all DDGs and group them by their decision event. We create one model per decision event. For this we collect all metadata that was present in the profiling when the decision event occurred and store it in a matrix, in which each column represents one kind of metadata (e.g., size of a specific array or the launch configuration of a kernel) and each row represents one decision event. At this point it does not matter if it is an array allocation or kernel execution decision. This gathered metadata usually contains a high amount of redundant, linear-dependent or constant data. To simplify the decision model creation and to reduce the metadata gathering overhead during runtime, we preprocess this data by removing all redundant data. If multiple rows with equal metadata but different optimal configurations occur, we perform a majority voting and keep the configuration which was chosen most. (2016) we used a SVM as decision model. This proved to be suitable but may not be optimal (depending on the application). This is rooted in the kind of metadata that we use in MATOG, consisting of array sizes and launch configurations. These come with some difficulties. For example, as we do not know the limits of our gathered metadata, there is no way to normalize it, which can result in bad decisions if the values differ too much from the training data. Further, as array sizes and launch configurations are usually used as iteration counts for loops, they imply a certain linear scaling, which can be leveraged in a decision model. We therefore experimented with other kinds of decision methods and found that in some On the left, the data used for training and the resulting models are shown. On the right, the models are applied to the testing datasets and false decisions are highlighted with a red cross. Be advised that the meta dimensions are hand picked and that the actual model does not only decide whether to use local or shared memory, but also decides on layouts, transpositions, and texture memory usage.
Directional Model. In Weber and Goesele
of the failure cases that we had encountered, a specialized nearest-neighbor search model worked better; in the following called Directional Model (DM). The DM uses the cosine similarity distance and interprets the training ( t) and meta ( m) data as vectors. It then applies the normalized dot product on these vectors in order to compute the metric: λ = cos (γ ) = t · m/| t || m|. The resulting λ is in the range [−1; 1]. To evaluate the model, we iterate over all training samples, calculate λ, and select the data point where λ is maximal. The advantage of this method is that it performs an implicit normalization, resulting in better decisions even if the meta and training data values significantly differ. Figure 8 shows a simple example for a linear SVM compared to the DM, both trained and evaluated on data taken from our KD-Tree benchmark, run on a Tesla K20c. Every dot represents the metadata of a kernel execution and colors indicate if either using local (orange) or shared memory (blue) is the better choice. In the upper part, we show the results for a linear SVM, with the training (left) and testing (right) data. The support vectors and false decisions are highlighted by black circles and red crosses, respectively. The hyperplane of the SVM is shown as a black line and the colored background visualizes the classification of the SVM. In the lower part, we show the same diagrams for the DM. The decision vectors are shown as arrows, with the color of the corresponding decision. As can be seen, the number of false decisions is significantly lower for the DM than for the linear SVM.
MATOG RUNTIME SYSTEM
In this section, we briefly explain our runtime system. At the beginning of the application run, MATOG is automatically initialized together with CUDA. Whenever an array is allocated, we determine which decision event it belongs to and store its metadata in a centralized metadata storage. If an array of the same decision event is allocated, it gets a new instance id assigned. If this array is a host array, its decision model is evaluated immediately and the layout is set (Type 1). Whenever a memcopy occurs, MATOG propagates the layout from the source to the destination array. If a kernel is executed, all non-initialized device arrays evaluate their decision model and set their layouts (Type 2). Further, the kernel evaluates its own decision model (Type 3). The final executed configuration of the kernel is the combination of all layouts of the device arrays (which have been separately evaluated before) and the configuration from the kernel's decision model. If this configuration has been compiled before, it is loaded from the database, and then executed. If not, it will be compiled and then executed. In Weber and Goesele (2016), we compiled all available combinations that could occur after our application analysis process; however, depending on the complexity of the application this can be several million configurations, making this approach not usable. We discuss in our future work how this high number of configurations could be reduced to improve the compilation time, but also reduce the number of implementation switches during runtime.
EVALUATION
In this section, we evaluate the different analysis steps of MATOG, their efficiency, and the achieved performance of MATOG on six applications. All tests have been performed on dual Intel Xeon E5649, 48GB RAM, Ubuntu 16.04 and CUDA 8.0 (driver version 367.57). We evaluated the last four GPU architectures on low-end, high-end, and HPC GPUs: Fermi (GT 440, GTX 480, GTX 570), Kepler (GTX 680, GT 730, K20c), Maxwell (GTX 980), and Pascal (GTX 1080). The results are compared to an unmodified but optimized reference code and our previous version of MATOG (Weber and Goesele 2016) . Changes to the auto-tuner in our new version include the following:
-Support of hierarchical data structures (e.g., array [x] [y].subarray [z] .substruct.
field [w] ). -Support for Fermi GPUs.
-Support to store data in constant memory (if size is constant and known at compile time). -New "read once" hint, allowing one to use texture memory to read non-read-only data once. This is helpful, if data is once read at the beginning of the kernel and then overwritten but never read again in the same kernel call. -Improved metadata system (as described in Section 4.4).
Providing a fair and direct comparison to related approaches on our benchmarks is difficult. On one hand, there is, to our knowledge, no publicly available code for several other memory access auto-tuners (Sung et al. 2012; Kofler et al. 2015; Peng et al. 2016) . On the other hand, auto-tuners such as Nitro (Muralidharan et al. 2014) or OpenTuner (Ansel et al. 2014 ) do provide code but miss specific, automatically generated optimizations. Instead, they require all optimizations considered to be explicitly hand-coded by the user, a stark contrast to MATOG's automated approach. We can, however, make indirect but meaningful comparisons: Nitro is designed to handle only a small number of configurations and thus relies on exhaustive search. Since we provide both timing as well as performance evaluations for exhaustive search as part of our results below, these can be seen as representative for Nitro. OpenTuner uses a genetic algorithm. In Weber et al. (2015) , we showed that our predictive-based search uses a minimalistic set of configurations that suffices to find comparable results and is faster for the optimization problem that we face in MATOG, compared to a genetic algorithm. This performance argument is still valid for the current version of MATOG and thus provides a clear advantage. 
Benchmark Applications
We evaluate six different GPU applications, ranging from very simple algorithms with regular workload up to very irregular algorithms with varying workload. Table 1 shows the number of possible configurations per GPU and benchmark. Please refer to our project page 2 for details on our training and test datasets. Bitonic Sort (Batcher 1968 ) is a widely used parallel sorting algorithm. In our implementation, it sorts a 1D AoS with four integer fields (8, 4, 2, and 1 B), ensuring that the 8B value is sorted first and only if this value is equal, the 4B value is sorted and so on. This results in a sorted list, for all fields. To ensure conflicting rows, we limit all values to 0 to 1,023 (255 for the 1B field). The application consists of two kernels: One uses shared memory in loop iterations where it can be used efficiently, while the other directly operates on global memory. The reference implementation is not optimized and uses a naïvely AoS layout. We train on two datasets and evaluate on five others, all with varying element counts (64k to 4M).
Speckle Reducing Anisotropic Diffusion (SRAD) (Che et al. 2009 ) is a diffusion method for ultrasonic and radar imaging applications. The benchmark's computations are quite simple and use a straightforward implementation without much program logic or dynamic allocations. We train on three parameter sets with the same grid size and one iteration, and evaluate 12 other parameter sets with varying grid sizes and 100 iterations.
Hotspot (Che et al. 2009 ) calculates the processor temperature based on an architectural floor plan and simulated power measurements. The overall execution time of the application is quite low. The benchmark comes with three datasets, so we train on the medium-sized one and evaluate on the other two.
Coevolution via MI on CUDA (COMIC) (Waechter et al. 2012) calculates the coevolutionary mutual information for protein and DNA sequences. It consists of three kernels: The first one initializes a randomization seed that is used in a second kernel to permute the sequences. The third kernel performs the main operation by creating a 3D histogram of occurrences in the permuted sequences. This kernel is templated and uses different compression schemes depending on the input data. The original implementation was optimized for a GTX 480. We train on two datasets with one iteration and evaluate on 11 others using 100 iterations.
Renders Everything You Ever Saw (REYES) (Cook et al. 1987 ) is a technique used in movie productions. In contrast to classic 3D mesh rendering, it uses patches to model smooth surfaces. The patches are transformed into micro polygons and iteratively split into smaller polygons until they have subpixel size. The implementation uses four kernels: One of the kernels performs the splitting while the second compresses the micro polygons after each iteration. A third kernel uses the final polygons and renders the resulting image into a depth buffer which contains depth and color information for each pixel. Finally, a fourth kernel extracts the color information from this buffer into a 2D texture which then can be displayed. The benchmark was originally developed for a GTX 480. As this benchmark has a very high number of configurations, we do not show any exhaustive profiling results as it would require several weeks to profile the application. We train on one model, rendering two frames on two different resolutions and evaluate on eight models, 100 frames, and varying resolutions. The models are rotated after each frame to change the workload, as depending on the viewing angle, more or less polygons have to be split and rendered.
The KD-Tree benchmark constructs acceleration structures for 3D ray tracing and resembles the work of Popov et al. (2006) . The application consists of two main and six maintenance kernels, which have a low total execution time. The first main kernel discretizes a triangulated scene in multiple bins which are separated by equidistant planes in all three dimensions. It then calculates a building heuristic to determine the best split plane. The second main kernel performs the actual splitting of a subtree and stores all necessary data in two different data segments, one for each child tree. All kernels are built in a fashion that they can process multiple subtrees in parallel. More information on this benchmark can be found in Weber and Goesele (2016) . The difficulty of this benchmark is the changing data characteristics over time. At the beginning, only very few but very big subtrees are processed. While the number of subtrees significantly increases during the processing, the size of these is reduced and the number of leaf subtrees constantly increases. In the end many, very small subtrees are processed. This change has a significant impact on the performance and should benefit from MATOG's adaptiveness. This benchmark was optimized for a GTX 590. Due to the extreme high number of configurations, we do not show any exhaustive profiling results. Further, as each iteration memory is reallocated, our implementation of the exhaustive post-processing cannot be used, as the possible combinations exceed 2 64 . We train on one medium-sized 3D scene and evaluate on nine others. As some of these are very big (up to 12M triangles), not all GPUs are able to process these because of memory size limitations.
Execution Performance
In this section, we evaluate the achieved execution performance of MATOG against an unmodified reference code and our previous version (Weber and Goesele 2016) . All executions have been repeated five times (except for the application profiling runs), to reduce the influence of measurement noise. All evaluations have been performed on separate training and evaluation datasets.
GPU Execution Time.
First, we take a look at the achieved speedup for the kernel execution time. We have evaluated three different decision model types: SVM, DM, and a non-adaptive model, that always selects the single configuration that was chosen most, to verify if adaptive decisions are necessary to achieve better performance. Further, we evaluated three analysis strategies: exhaustive profiling/exhaustive analysis, predictive profiling/exhaustive search (method used in Weber and Goesele (2016)), and predictive profiling/predictive analysis (our new method). As our old auto-tuner does not support Fermi GPUs, no results are available for the GT 440 and GTX 480 and 570. The results are shown in Figure 9 . As mentioned before, Bitonic Sort, SRAD, and Hotspot are simple algorithms without much divergence and a very regular processing. SRAD and Hotspot are already optimally optimized in terms of memory access. MATOG is slightly slower than the reference code, as MATOG data structures employ a certain overhead compared to hardcoded memory access. One reason for this is pointer aliasing, as the way we have implemented the MATOG data structures does not allow one to use the restrict keyword to define the internal pointers as non-overlapping. Therefore, the compiler is unable to perform certain optimizations, which can slightly decrease the performance. In most cases, no real difference between the different analysis methods and decision models can be seen, implying that these are invariant to adaptive optimizations. For SRAD, the exhaustive profiling usually finds slightly better solutions than our predictive profiling, but has to profile over 2,371× more configurations to achieve this result. Our new auto-tuner achieves significantly better performance in SRAD and Hotspot compared to our previous version. We are unsure why this happens and assume that internal improvements of the data structures are responsible for this. On the GT 440 and Tesla K20c, MATOG is even able to achieve higher performance than the hand-optimized implementations. The irregularities for the 28:20 N. Weber and M. Goesele Fig. 10 . CPU, GPU, and total speedup compared between the reference code and our proposed method. For Bitonic and Hotspot, we can see that the overhead introduced into the MATOG system is not too high. Through the extremely low execution time of SRAD, the setup overhead of MATOG decreases the CPU performance. COMIC, however, benefits on the CPU and GPU side from the MATOG data structures. For REYES and KD-Tree, the optimized layouts for the GPU somewhat decrease the performance of the CPU, as they perform less optimally. However, for the three complex benchmarks and the Bitonic Sort we mostly see an improved overall application performance. SRAD's performance is decreased and Hotspot is neutral.
results of the GT 730 on Hotspot are difficult to explain. We think this is caused by the heat of the chip, as this GPU is passive cooled and therefore can only regulate its temperature by reducing the clock frequency. So far, all benchmarks have been simple algorithms, consisting of only one to two kernels with a very low execution time, that do not benefit from adaptiveness during runtime. The three remaining applications are much more complex, consisting of three to eight kernels. For COMIC we can see that the reference code is optimal for the GTX 480 and 570, which it had been developed for. On the newer cards, however, MATOG is able to leverage up to 34% more performance. MATOG chooses no texture memory on all GPUs prior the GT 730. The main speedup is achieved through transposition of the data stored in shared memory. Local memory is never used. This benchmark also does not benefit from adaptiveness. Although REYES has been optimized for the GTX 480, MATOG is able to slightly improve the performance on this GPU. The highest speedup of 37% is achieved on the Tesla K20c. As can be seen, our adaptive methods achieve up to 7% higher performance. The KD-Tree is by far the most complex and difficult to optimize applications we evaluate on. It has a lot of changing data properties, a huge number of reallocations, and an irregular work processing. The non-adaptive model achieves decent speedups, but the DM always achieves the best and is up to 32% faster than the non-adaptive solution. The SVM always performs less optimally than the DM, especially on the GT 730. The SVM is here unable to understand the underlying coherence of the metadata since the values differ too much from the training data, as it cannot be normalized. This causes the SVM to use local memory very often, whereas this is only optimal in the very first iterations with few big subtrees and performs very badly in all subsequent iterations. To summarize, we can say that MATOG achieved the highest speedups on the Tesla K20c. There is no real difference between the analysis methods (in terms of quality), except for the SRAD benchmark, where the exhaustive profiling often is able to find better configurations. Further, the DM has achieved the highest performance (compared to the other methods). In the very complex applications, our adaptive optimizations outperform the static optimizations by up to 27%.
Application Execution Time.
Next, we take a look at the application execution time, containing not only the GPU but also the CPU computation time. In Figure 10 , we show the speedup of CPU, GPU, and the total application speedup separated, in relation to the reference code. Except for COMIC and some cases of the Bitonic Sort, all benchmarks push the execution more toward Fig. 11 . Percentage of CPU time for the reference and our method. As can be seen, the three simple benchmarks have a significantly higher portion of CPU time, therefore solely optimizing the GPU does not necessarily benefit the application. For the other benchmarks, the CPU time is much lower. Further can be seen that newer/faster GPUs push the ratio more to the CPU. the CPU. This can be explained by the fact that the MATOG runtime system produces a certain overhead through an additional setup time, metadata tracking, decision making, exchanging of different GPU codes, and the MATOG host arrays, as they use a dynamic data layout which is determined by runtime that does not perform as fast as a static hard-coded memory access. Further, data layouts that perform optimally on GPUs do not necessarily perform well on CPUs. For the GPU, we have already seen the achieved speedups, which are mostly positive. By looking at the total execution speedup, we see that except for SRAD, all benchmarks are faster, or at least as fast as the reference implementation. Figure 11 shows that the simple applications (Bitonic Sort, SRAD, and Hotspot) spend most of their execution time in CPU code. The up to 80% GPU speedup for the Bitonic Sort does, therefore, yield only a small total speedup. Pennycock et al. (2016) proposed a method to measure performance portability for applications across multiple platforms. They calculate the harmonic mean over the execution time for one approach over a series of executed applications and normalize it with the harmonic mean of the best results for the applications. Figure 12 shows the results for CUDA and our proposed method, compared to all other methods we have analyzed. Except for SRAD and Hotspot, our method is superior to the pure CUDA implementation for the GPU and application efficiency. Overall CUDA achieves an average efficiency of 87.06% GPU and 94.79% application efficiency, while our proposed method achieves 98.65% and 97.03%, respectively. Assuming that an exhaustive search would yield 100.0% GPU efficiency, our method comes pretty close, whereas it requires significantly less time to achieve its results, as we show in the next section.
Analysis Time
Finally, we take a look at the time required for the analysis. Figure 13 shows the measured times for the profiling and analysis of the benchmarks on the GTX 1080. As stated before, REYES and It is easy to see that the predictive profiling is significantly faster than the exhaustive profiling and can profile all applications within a few minutes. For reference, the execution time required to run the application with the training datasets but without profiling is also provided. As can be seen, for SRAD the exhaustive search requires more than 12.000× more time for the profiling, compared to the application runtime, whereas our method only requires 4.5×. Right: Time required for the analysis of the profiling data. The analysis is quite fast for all methods (compared to the profiling); however, our proposed method (green) is always the fastest. Further, it can easily be seen that with increasing complexity the execution time rises. The exhaustive analysis requires more time using exhaustive data (blue) than with predictive data (orange) caused by much more data that needs to be loaded and processed.
KD-Tree require too much time for the exhaustive profiling and have therefore been excluded. Except for the Bitonic Sort, our predictive profiling is always significantly faster than the exhaustive search. The reason for this is that the Bitonic Sort only has very few possible configurations, so that the number of configurations hardly differ.
DISCUSSION
Summarizing our results, we can see that it is essential to adapt the code to the specific GPU architecture, as code written for older cards does not necessarily perform well on newer hardware, especially for complex applications with irregular or changing workload. Depending on the complexity of the application, static optimizations (Bitonic Sort, SRAD, Hotspot, COMIC) may suffice. Anyway, these optimal parameters have to be found in the first place. In other cases (REYES and KD-Tree), dynamic optimizations achieve a much higher performance. Further, we can see that optimal memory access is crucial in many applications and can significantly influence the performance of the code. However, solely optimizing the GPU code does not necessarily improve the overall performance as can be seen in the Bitonic Sort benchmark, where the 80% speedup of the GPU code is nearly entirely consumed by an increase of the CPU execution, as the optimal layouts for GPUs can be suboptimal for CPUs.
Analysis of Optimal Configurations
In Figure 14 , we show how often which layout, cache size, transposition, or memory has been used by our method. The results differ not only between the different benchmarks significantly, but also between the GPUs even within the same architecture. There is no clear tendency toward a specific struct layout, transposition, L1 cache size, or the usage of texture memory. An analysis of the resource usage of the code and the SASS assembly code showed, however, some cases that could also easily be optimized by a programmer.
Simple Memory Bound
Benchmark. The Bitonic benchmark mainly uses SoA since all threads access adjoining indices. In some cases, AoSoA is equally good because all tiles are stored Percentage of how often which layout, cache size, transposition, or memory was chosen to be optimal by our method on the testing datasets. As can be seen, the results significantly differ between the benchmarks but also between the GPUs, even within the same architecture.
in a SoA-manner. However, AoSoA requires more calculations to determine the correct memory address (soa.field [index] vs. aosoa [index/32] .field[index%32]). The kernels of the Bitonic benchmark are memory bound and far from the resource border where the occupancy could drop because of a higher register usage, so that additional computations are no limiting factor for the performance. SoA and AoSoA therefore perform similarly, and the variation in our results is probably mainly caused by measurement noise.
Complex Compute Bound Benchmark.
For the more complex benchmarks, e.g., the third kernel of the REYES benchmark, it is more difficult to explain all effects. In contrast to the Bitonic kernel, it is compute bound and much more complex, as it uses multiple multi-dimensional arrays with struct types in global and shared memory. Although it is compute bound, the worst configurations we found are 23% slower on a GTX 980 than the best. We have mainly two observations for this kernel. First, on Fermi GPUs an AoS layout works optimally for the data stored in shared memory, while on newer Maxwell or Pascal cards, SoA is consistently used. As the way in which shared memory is supposed to be used (according to the programming guide (NVIDIA 2016)) has not changed since Fermi, we did not expect this result. Second, on the Tesla K20 the micro polygons are stored as SoA in device memory, while they are stored as AoS on the GTX 980. The projection matrices (in constant memory) and micro polygons (in shared memory) are stored transposed on the Tesla K20, while on the GTX 980 they are untransposed. Analyzing the SASS assembler of both configurations, we realized that the code for the GTX 980 is longer and issues more instructions per instruction block than the code optimized for the Tesla K20. The results of the NVIDIA profiler confirm this, as the code executes 2.4% more integer instructions, but achieves also 1.1% more instructions per cycle. On the Tesla K20, it is behaving differently as the optimized code for the Tesla K20 issues 1.8% less integer instructions than the code optimized for the GTX 980, and achieves 0.4% higher instruction per cycle. It seems almost as if the different memory layouts trigger the compiler to perform very different code optimizations which have a strong implication on the achieved performance for compute bound kernels.
This example shows the need for auto-tuners such as MATOG, as solely concentrating on one optimization goal (e.g., optimal memory access) does not necessarily lead to maximal performance. MATOG is designed to optimize memory access, but through the usage of empirical profiling it implicitly optimizes resource usage and computations.
Unintuitive Optimizations.
Another surprising result is that the optimal size of the L1 cache size seems to often be unpredictable. Reducing the amount of shared memory can reduce the occupancy, which can in turn have positive or negative effects on the execution time (Volkov 2010) . Further, since the Kepler architecture, the L1 cache is supposed to only serve local memory, so an intuitive way would be to use more L1 cache in kernels which heavily rely on local memory and use only small amounts of shared memory. However, in some of these cases this approach seems to work, but in others different settings work better.
Overall, the usage of local memory instead of shared memory only proves to be beneficial in some rare cases. REYES is the only benchmark where certain array sizes are known at compile time. This benchmark greatly benefits from using constant memory, although the usage of texture, global, and constant memory varies significantly between the different architectures.
Furthermore, when to use texture memory is difficult to answer. Our results indicate that a certain balance between using texture and global memory has to be found to achieve maximal performance. What we can see is the obvious fact, that using texture memory only benefits arrays that are read more than once. However, not using texture memory in this case allows other data to be cached there instead. To use texture memory always, e.g., through blindly defining all read-only arrays as const __restrict__ type* on GPUs with CC ≥ 3.5, should be avoided.
Conclusion
In this article, we have provided a unified view of the MATOG components published in the previous conference publications (Weber et al. 2014 (Weber et al. , 2015 (Weber et al. , 2016 as well as an improved application analysis algorithm and runtime decision-making system. Our auto-tuner is capable of analyzing applications with billions of possible configurations within a few minutes while achieving comparable results to an exhaustive analysis. MATOG's ability to adjust itself during runtime to react to changing data properties allows one to achieve higher performance than a static hand-optimized implementation. We further showed the flexibility of MATOG to adjust itself to low-end (GT 440 and 730), high-end (GTX 570, 680, 980, and 1080) , and HPC GPUs (Tesla K20c).
However, our results also show that there is still room for improvements leading to several different directions that we want to pursue in future: First, MATOG does not take CPU times into consideration. As every GPU application is a heterogeneous device application, this does not guarantee overall optimal performance. In the future, we therefore plan to extend MATOG to different device types. Second, we want to explore more optimizations, as e.g., an explicit cache bypassing (Li et al. 2015a) , an automated selection of optimal sparse matrix formats (Muralidharan et al. 2014) , or an automatic separation of AoS fields to partially be stored in a AoS or SoA (Peng et al. 2016) . Third, we plan to integrate MATOG into an optimizing compiler (e.g., OpenARC (Lee and Vetter 2014) ) which then will integrate MATOG automatically into existing applications. Fourth, one of the current shortcomings is the very high number of nearly optimal configurations. This
