Abstract-Sparse grids have already been successfully used in various high-performance computing (HPC) applications, including data mining. In this article, we take a legacy classification kernel previously optimized for the AVX2 instruction set and investigate the benefits of using the newer AVX-512-based multiand many-core architectures. In particular, the Knights Landing (KNL) processor is used to study the possible performance gains of the code. Not all kernels benefit equally from such architectures, therefore choices in optimization steps and KNL cluster and memory modes need to be filtered through the lens of the code implementation at hand. With a less traditional approach of manual vectorization through instruction-level intrinsics, our kernel provides a differently faceted look into the optimization process. Observations stem from results obtained for node-and cluster-level classification simulations with up to 2ˆ28 multidimensional training data points, using the CooLMUC-3 cluster of the Leibniz Supercomputing Center (LRZ) in Garching, Germany.
I. INTRODUCTION
Today's processor architectures make for a double edge sword with regards to HPC: on the one hand more performance comes cheaply, being accessible by even the moderately optimized codes. On the other hand, however, to truly use the full capabilities of today's clusters, a deep understanding of both the underlying hardware and the computational kernels is still irreplaceable. Compilers are getting better at doing work for developers and are more lenient towards poor uses of available resources, but with core designs growing in complexity as well, the task of optimization has never been more demanding. This is particularly true for current multi-and many-core Intel architectures, which are already the basis of many highranking Top500 supercomputers 1 . Data mining is an evolving and quite broad field of study that can benefit significantly from the development of new HPC architectures. Problems in both supervised and unsupervised learning settings have found solutions through the use of multiple algorithms and numerical methods, some more suitable than others for large-scale simulations. One such family of numerical techniques is represented by sparse grids, which aid in the representation, interpolation, and integration of multidimensional functions [1] , [2] . These methods relieve 1 https://www.top500.org/lists/2018/06/ some of the burden brought by high-dimensional problems (the so-called curse of dimensionality), and thus, through the additional layers of parallelization, applications using sparse grids can scale to a larger extent than other approaches.
This article, based on research done within the Intel Parallel Computing Center (IPCC) at LRZ, sets out as a study case for the particular data mining application chosen, the optimization route taken (by employing manual vectorization techniques), and the hardware on which it is run. The focus of our efforts is an architecture that has introduced several noteworthy new features: the 2nd generation Xeon Phi many-core architecture, code-named Knights Landing (KNL). We argue that the lessons learned throughout this work can be transfered looking forward to newer architectures (starting with Skylake) and to computational kernels similar to the one considered here, which might also benefit from the kind of hands-on approach to optimization we showcase. The main contributions of this article are therefore:
• Investigating optimization techniques for performance gains with AVX-512 on our legacy kernel; • Studying the influence of hardware specific features (i.e. KNL operating modes) at the core-, node-, and clusterlevel; • Performing scaling tests on the chosen testing platform and measuring performance gains relative to the legacy code running on the AVX2-based Haswell processor. The paper is organized as follows. Section (II) gives a short introduction into sparse grids and the derivation of our kernel. We focus more on the code implementation aspects that are relevant for our optimization steps than on the full mathematical derivation. Next, in section (III), we introduce the hardware, specifically the KNL processor and its operating modes, before shortly describing the main kernel changes and optimization attempts in section (IV). Results from thread, core-, node-, and cluster-level simulations are covered in section (V). Finally, we conclude this case study in section (VI) by presenting lessons learned and an outlook towards future contributions. standard approaches. To see how this works, a finite element method perspective is presented. Thus, we search for grids that live in the multidimensional space spanned by a set of basis functions, usually constructed hierarchically from an initial multilevel one-dimensional finite basis. The standard approach is to use piece-wise linear functions derived from the hat function ϕ(x), which is non-zero in the domain [−1, 1]:
In the nodal basis setting, these standard linear basis functions are defined by a level l and index i as
For efficiency purposes, we switch to a hierarchical setting, skipping functions of higher level that can be represented by hierarchical parents of lower level. For the boundary of the domain we have to add separate functions (of level zero), as they do not stem from the recursive construction ( Fig. (1a) ).
With such a representation we can now create a full grid in a d-dimensional unit hypercube [0, 1] d simply as the tensor product of one-dimensional basis functions. A sparse grid is then nothing more than a (smart!) choice of a subset of subspaces (i.e. the span of all basis functions of a given level) such that the number of grid points is kept low, while the accuracy of representation is maintained high. An example of a full grid and a sparse grid are given in Fig. (2a) and Fig. (2b) , respectively. More details on the mathematical formulations and properties of sparse grids can be found in [1] .
For the grid generated with the linear basis functions described thus far the number of boundary grid points becomes very large in higher dimensions. For the vast majority of cases such a high fidelity representation on the boundary (much higher than inside the hypercube domain) is computationally wasteful. Because of this, a modified linear basis (which we shall call from now on modlinear) can be used instead [3] :
With this recursive construction we can generate points as close to the bounds of the domain as possible without actually being on the boundary, which keeps the number of grid points generated much lower. The hierarchical representation of the first 3 levels of this basis are shown in Fig. (1b) . The main downside of the modlinear basis is the 4-way if-statement that is needed to decide how to treat each grid point. This is reflected in the numerical results obtained.
Any function that we want to represent on a given sparse grid is approximated as a linear combination of all the basis functions:f N ( x)= N j=1 α j ϕ j ( x). As we want to come as close as possible to our target function, we can modify our grids in a process of spatial adaptivity, where a new grid point is added in a place where the difference between the sparse grid approximationf and the actual solution f is high. One can think of several criteria for selecting these new points, however in our case, we employ the standard approach of maximum surplus refinement. As the name suggests, we are looking for the surplus (i.e. the coefficient α j ) with the highest absolute value and we add all hierarchical children of the corresponding ϕ j basis function, as well as any missing parent points (in order to keep the hierarchical structure alive). An example of a refinement step is shown in Fig. (2c) .
B. The sparse grid regressor
We focus in this article on the multidimensional and usually large-scale problem of classification using sparse grids. This is nothing more than a particular case of regression (as it will become clear in the following), so the same approach can be used in both regression and classification. The method is quite elegant in its simplicity. We present it here shortly, while the full derivation can be found in [4] .
Our regression problem can be stated as follows: for given input pairs {(
representing the coordinates of multidimensional points in space and their respective real values, we are trying to approximate as close as possible the exact function that satisfies y i =f ( x i ), ∀i. This can be equivalenced to solving the following variational problem:
where λ∈R is a regularization parameter. We restrict ourselves to the discrete setting of (adaptive) sparse grids, meaning that we search for an approximationf N ( x)= N j=1 u j ϕ j ( x) in the discrete d-dimensional space spanned by the basis functions ϕ j . By imposing a null gradient for our variational problem, along with a few additional constraints and simplifications (e.g. smoothness conditions for the solution), we are left with solving a system of liner equations in order to get the unknown surpluses u j : A conjugate gradients (CG) method is employed to solve this equation, meaning that our computational kernel relies heavily on optimizing two operations: v=B T u (which we shall call mult), and w=B v (which we shall call multT).
As mentioned before, in this article we tackle the equivalent task of binary classification, which in our sparse grids setting is nothing more than a regression problem where each y i can only take values in the set {−1, 1}. As a remark, we focus only on the computational aspects of the implementation; the competitive levels of accuracy and overall performance of the method itself have already been previously documented [3] .
C. Legacy implementation
The code we start with is based of a much larger library called SG++ [5] . This toolbox is meant to provide various lowand high-level functionalities for applications using adaptive sparse grids such as function interpolation, quadrature, partial differential equations, data mining and machine learning, function optimization, and uncertainty quantification. While most of the library is written in a manner that is aware of low-level optimizations (e.g. vectorization), there are some kernels that have been specifically tuned to get the maximum out of modern hardware. The classification application that we investigate is one such code, which has already passed through an optimization procedure for 256-bits (and lower) wide architectures, as well as improvements for the now retired MIC architecture [6] . The optimizations focus on the two matrixvector operations, mult and multT, which are decomposed on the underlying sparse grids and performed iteratively using 3 loops: one over the dataset instances, one over the grid points, and one over the dimensions. In the standard scheme (i.e. OpenMP only), operation mult is parallelized over the dataset, while multT is parallelized over the grid in order to remove loop carried dependencies. Unfortunately this second operation needs to perform a costly horizontal addition for each grid point, however a tree-like approach to gather the partial results and a specialized AVX-512 instruction help to improve the performance. Additionally, the dataset loop is split into chunks whose sizes fit the underlying hardware (i.e. the number of existing vector registers for the considered architecture). Data padding (by repeating dataset entries, which does not affect the classification results) and memory alignment are also used in order to make sure that the workload is equally distributed per worker threads and that the vector instructions generated are optimal. To further provide performance on the systems targeted, loop unrolling, manual vectorization with intrinsics, and a static scheduling strategy for the OpenMP loop parallelization are implemented. All these aspects become extremely relevant when discussing potential and practical performance gains from newer AVX-512 architectures. We again have to point out that, by choosing not to rely on the compiler for our vectorization, we are consciously doing a trade-off between what the compiler can perform and our specific knowledge of the targeted hardware.
With regards to the metric used to compare the different simulations, the application incorporates a Gflop/s counter based on measured execution time and the nature of the operations that are performed in the kernel. This metric is both relevant and reliable in measuring the actual performance of our classification. For thread-and core-level simulations, this measure might be misleading, as it masks latencies or time spent idling in hyperthreading, so we use the runtime instead.
D. Communication schemes
We offer now some details regarding the three legacy MPI communication schemes that were used in our tests. The first implementation (named Alltoallv) uses a straight-forward, although less efficient, idea: each process performs work on its chunk of the dataset and then a call to MPI Allgatherv() synchronizes the full result vector. With this approach being used for both mult and multT, there is a clear delimitation between computation and communication, with two synchronization steps per CG iteration, making this scheme, although fully deterministic from a numerical perspective, not a good choice for scaling across a high number of processes.
The first improvement to this scheme comes with the Async approach which, as the name suggests, performs an overlap of the computation and communication steps by transmitting each processes' chunk of computed result vector to the other processes through point-to-point non-blocking communication. Even though the current scheme could in theory be optimized to using collective non-blocking operations, this did not happen in our tests for our application due to two main limiting factors: 1) we do not reduce the number of (two) synchronization points required per CG iteration; 2) the overhead of calling MPI Ibcast for each data chunk outweighs the internal optimizations of the point-topoint variant for the chunk sizes used. On the other hand, the performance of this MPI scheme suffers when moving to the cluster-level simulations as the messages passed become shorter and thus the overhead of the communication scheme increases.
The third scheme considered is the one that proved in the previous optimization results to counter the drawbacks of the other approaches. By considering both mult and multT operations as non-divisible, the whole kernel can be parallelized across the dataset, and then each partial contribution summed up by just one final MPI Allreduce. The only downside to this approach is that now the resulting values are no longer deterministic due to the uncertain order of summation which can introduce some rounding errors. However, in the larger scope of the classification problem, this effect is completely negligible with respect to the quality of the learned classifier. All these three schemes were compared at the node-level to confirm expected results, while at the cluster-level only the best implementation was used.
III. HARDWARE SPECIFICS
The study of the code performance was conducted on the CooLMUC-3 system 2 of the Leibniz Supercomputing Center (LRZ) in Garching, Germany, part of the larger Linux-Cluster installation. It is a stand-alone 148-node KNL cluster with an Intel Omnipath interconnect, installed with a highly efficient water cooling system and the possibility to extend the current cluster with other heterogeneous compute nodes in the future. Each node is a Xeon Phi 7210-F with 64 cores, 2 FMA-based VPUs per core, 4-way hyperthreading, a nominal frequency of 1.3GHz, 96GB of DDR4 memory and 16GB MCDRAM. The theoretical peak performance is thus 394 Tflop/s DP, with a LINPACK performance of 255 Tflop/s DP. Unfortunately this particular cluster operates only with Intel Turbo Boost technology enabled, so care has to be taken when interpreting scaling results. Comparing with the Xeon processors, KNL operates at much lower frequencies, and the binning from the Turbo Boost is given in the specifications provided by Intel: for single tile runs, we need to add 200MHz to the nominal frequency; for all-tiles runs, only an extra 100MHz instead; and for high AVX-512 use, the penalty is 200MHz. 3 The KNL architecture comes with some distinct characteristics that have a significant impact on the peak performance that can be achieved by HPC applications and that make this processor attractive for a case study [7] . On each node, single cores have their own L1 cache, each tile (pair of cores) share an L2 cache, and the L3 cache that can usually be found in CPUs is completely missing. The L2 caches are kept coherent via a 2D bidirectional mesh that connects the tiles vertically and horizontally. With such an architecture, KNL can claim a significant improvement to the first generation Xeon Phis, the Knights Corner (KNC), with especially better performance in latency-bound codes. However, this kind of memory hierarchy also adds a new level of complexity to the optimization procedure, leaving more work for developers in order to find the optimal configuration for their codes.
Three clustering modes exist that address the locality of data with respect to cache requests: 4 • In the all-to-all mode the memory addresses are distributed uniformly across the node. With the worst performance of the three, this mode is only for debugging problems with faulty memory chips, where the other options might fail. As such, we have excluded it completely.
• Quadrant mode divides the tiles of the KNL node into quarters, each with its own memory controller. For the operating system this split is not visible, however at the architecture level this mode allows for a better communication both inside and between quarters, with lower latencies for L2 cache misses. Especially for applications that treat the KNL as a symmetric multi-processor (SMP), this mode offers usually the best results for virtually no additional code development. A hemisphere mode exists as well, which splits into halves instead of quarters.
• For applications that are heavily reliant on the nonuniform memory access (NUMA) model, there exists a sub-NUMA clustering mode called SNC-4 (which also has a version for halves, SNC-2), providing relief mainly to memory-bound codes. As an additional note, for some Xeon Phi chips (i.e. with 68 cores), the tile division is unbalanced from the start, so the SNC-4 mode might not deliver the desired performance. The suitability of this mode to our application is addressed when discussing our results.
With a larger, slower DDR4 main memory coupled with a smaller, faster high-bandwidth memory (HBM), the KNL node offers another, separate set of operating modes, that can be used independently to the clustering modes:
5
• The cache mode, as the name suggests, allows the MC-DRAM to be treated as the otherwise missing L3 cache, shared across all tiles. This has the benefit that no actions from the developers are needed (i.e. no changes in code), the HBM being completely transparent to the software and managed by the operating system itself. However, access to the main memory becomes significantly more expensive, as all requests have to first check the HBM.
• In flat mode the MCDRAM and DDR4 share the physical address space and it is left to the developers to decide which entities reside where. When sub-NUMA clustering is at play, the task of pinning the addressable memory becomes of high importance and for some HPC applications even cumbersome. More precise actions can be taken by using tools such as memkind for in-code allocations, whereas one can use instead numactl to associate memory chunks to different NUMA domains and thus improve data access times.
• The hybrid memory mode, as the name suggests, tries to play on the strengths and weaknesses of the other two approaches by allowing a 50/50 (or some other platform available ratio, e.g. 70/30) between L3 MCDRAM and addressable HBM. As the specifications of this mode describe, this option is not suitable for most HPC applications, which are not able to have such a fine control on memory objects locality. Consequently, this mode was not used in our study.
Some tried and true approaches to choosing the best memory mode need to be mentioned, as they appear in several best practice guides 6 . For applications that fit entirely in the 16GB limit of the MCDRAM, the flat mode with full allocation in the HBM should work best, as the latencies are minimal and no code changes are needed. For codes that are not memory bandwidth bound, flat mode with allocation in DDR4 would be equivalent in performance. If the application memory requirements go beyond the 16GB limit, the bandwidth sensitive code sections need to be isolated and allocated in MCDRAM. If this cannot be done, one option is to use up as much of the HBM with a command like numactl -p, with the rest of allocations in DDR4, although this might not always be optimal. In all other cases, or when this partial allocation fails, cache mode should be used instead.
For completeness, we also present comparisons with results obtained from the legacy parts of the code, the AVX2 optimized kernels, by running core and node simulations on the Xeon E5-2697 v3 (Haswell) compute nodes of the CooLMUC-2 cluster at LRZ (also part of the Linux-Cluster installation). Each node consists of 28-core dual-socket CPUs running at a nominal frequency of 2.6GHz and communicating via an FDR14 Infiniband interconnect. All simulations in this study were performed using the Intel C++ compiler v17 (which has support for OpenMP 4.5) and Intel MPI v2017. The usual recommended compiler flags were used. 6 
IV. KNL OPTIMIZATIONS
We shortly describe in the following the main changes that were done to the kernel in our attempt to best utilize the KNL architecture. As a general remark, even when the starting point is a legacy code that already underwent some optimization steps, it is important to be systematic in the assessment of where improvements may lie. (Optimization guides aiming at both compiler-based vecorization and at additional techniques exist and are particularly useful for such purpose.
7 ) The first performance boosts as well as missed opportunities come directly from the adaptation of the intrinsics used in the kernel, from AVX2 and MIC to AVX-512. 8 The embedded broadcasting capabilities of KNL offer a quite nice and cheap speedup by default, although measuring its effects is almost impossible, especially on a code that does not rely heavily on such operations. However, KNL does not implement the AVX-512DQ extension needed for some of the operations in our kernels (such as the ones for applying the same mask from memory to multiple vector registers). These instructions only became available with the more recent Skylake processors.
At this point a short discussion on the specific instructions in our classification code is needed. Depending on the choice of basis functions for our sparse grid technique we obtain two different kernels, which reflect the way data points get evaluated; to increase computational performance, the basis functions, with their intrinsic hierarchical structure, are decomposed into their corresponding pairs of level and index values, transforming our data structure from an array-of-structures (AoS) to a more suitable structure-of-arrays (SoA) representation. For the regular linear basis functions, the evaluation of data points is performed uniformly across the whole grid. Unfortunately this basis requires a very high number of grid points on the boundary of the computational domain, thus we employ the modlinear basis as well. Previous results [6] have shown that it drastically reduces the amount of grid points needed while maintaining the high accuracy of the classification procedure. However, as nothing comes for free, the gains we obtain by more than halving the time to solution are offset by a significant hit in the achieved computational performance due to a 4-way if-statement that decides which kind of mathematical operations should be performed when evaluating at each grid point. A somewhat intricate masking procedure is needed to help relieve this problem to some extent, thus obtaining just above half of the Flop/s of the linear basis. In this article we show that both kernels still perform well on KNL, however in the large-scale results we focus only on the modlinear implementation, as it allows the classification of larger datasets in less time with virtually no loss in accuracy of the learned classifier, making it of more interest for future optimizations on newer architectures.
Our classification application cannot hope to fully utilize the vector processing units (VPUs) of modern processors that rely heavily on fused multiply-add (FMA) operations to drive their computational performance gains. With just one FMA operation in a kernel otherwise formed of dependent instructions, the linear version is by construction not able to give much more than 50% peak performance, while the modlinear version, similarly to the previous results with AVX2, MIC and on GPUs [6] , fairs even worse and is supposed to give no more than a third of the theoretical peak. Of course, as almost all real-life applications do not adhere to the ideal theoretical case, in practice we can only expect to get sufficiently close to these predicted values.
Additional tests have been performed to determine if specific changes to the existing scheme would bring any improvements on the new architecture. For brevity, we skip the graphics for these attempts, and focus on the lessons learned from two of the most promising optimization attempts:
• We investigated the static partitioning that is being used to split the workload between processes and tried instead a dynamic scheduling using pragma directives of OpenMP with varying chunk sizes. Even though the workload per process should be theoretically balanced, we did observe a slight improvement when using the dynamic scheduling for low to medium number of threads, however when scaling to the full node the performance started worsening due to the overhead of context switching. Thus, we stuck with the more efficient, static (manual) partitioning.
• The loop chunk size was investigated more closely.
As mentioned, we manually unroll the dataset loop to account for the number of SIMD registers available. Unfortunately, this reasoning is not backed up by our tests, as the performance decreases when we increase the chunk size by one or more additional dataset instance, meaning that those extra registers are not idle (fact confirmed also by the generated assembly code): all registers not directly assigned by our intrinsic functions are indeed used for intermediary results, most likely in the effort of the compiler to optimize the execution of the computational loops.
V. RESULTS

A. The dataset
The binary classification problem was performed on artificial 5-dimensional chessboard datasets with varying sizes (between 2 18 and 2 28 for the training datasets). The points are randomly chosen in the domain [0, 1] 5 with 3 5 possible regions. The choice of partitioning in 3 regions per dimension makes it a harder classification problem, as the regular sparse grids (by construction) naturally tend to prefer powers of two. For each data point, the target class it belongs to is given by
This type of dataset has already been used in the previous optimization steps, and the parameters as well as obtained accuracy levels are well known (and shown in Table (I) ). The dataset can offer enough parallelism per thread when moving to cluster-level runs, making it perfect for obtaining representative results. Also, by being one of the datasets used in the past, relevant comparisons with the results on different previous architectures can be made.
B. Simulation results
We start the evaluation of our KNL optimized code by looking at the performance in a single thread setting. Fig.  (3) shows the comparison between the linear and modlinear kernels on Haswell and KNL running classification problems with 2 18 , 2 19 , and 2 20 training points respectively, thus offering also a glimpse into the weak scaling characteristics of our kernels (which are known to be good from previous results).
When moving to the full core, we expect that hyperthreading improves the single thread results. This is because, even though a single KNL thread is claimed to be able to achieve peak performance for ideal applications, in our case we have enough high latency operations (compared to e.g. Haswell) in the loops of our kernel to require hiding.
9 Fig. (4) shows the 9 http://www.agner.org/optimize/instruction tables.pdf comparison between the linear and modlinear kernels on the same training datasets with hyperthreading enabled. As for all other simulations that we have run, the pinning of the threads is done by imposing compact affinity, with threads preferring partially populated cores before empty ones, thus making sure that the results obtained are not influenced by thread migration. For our application we argue that compact affinity works best, as chunks are split consecutively to consecutive threads, so keeping neighbor threads close improves data locality. Also, similar data access patterns in literature have shown little difference between compact and scatter affinities [8] , however this cannot be generalized. As it can be observed from the plots, both 2 and 4 threads per core help to improve significantly the core performance. Using 3 threads per core is not recommended. This is due to the fact that on KNL the microarchitecture splits some of the internal workload into quarters, therefore a 3 thread configuration would use up less aggregated resources than the 2-or 4-threaded cases. For completeness we also add the single core results on Haswell. The node-level simulations bring some more insight into the performance of the KNL die. We collect the results for all possible core counts to better view the characteristics of the architecture, while also comparing a pure OpenMP implementation against the three MPI communication schemes used also in the old AVX2 optimized code. In all four cases we run simulations with both 2 and 4 OpenMP threads pinned per core, while also comparing the slower DDR4 and faster HBM by choosing the flat memory mode with and without allocation, respectively (via numactl -p). The size of the dataset tested was chosen to be the same as for previous results, making the comparison across types and generations of processor architectures possible. Fig. (5) presents the results obtained. The pure OpenMP performed best at the node-level and thus is used in the cluster-level simulations. (Fig. (6a) )
The results show also what we expected in terms of performance of the three MPI communication schemes, with the Allreduce approach having a smaller overhead. Of course, the true potential of this scheme is more relevant at the clusterlevel, where communication between different compute nodes becomes vital for the scaling profile of the application. One result that cannot be fully explained is the tendency of all three MPI-based schemes to have a sudden fall in performance when between 37 and 51 cores are used, in the case of two-threaded simulations. This kind of behavior does not seem to impact the overall peak performance that the application can deliver on the full node, however a more detailed investigation in the future is definitely warranted.
Again, we test our code on the Haswell as well, using the full node. A short look at the performance aspects is needed. For Haswell, the optimized linear kernel is just around 50% of theoretical peak (adjusted for Turbo Boost), while the modlinear around 30%. In contrast, the KNL sees a small dip in percentages of the theoretical peak. Overall, these results are in line with both older Intel CPUs, like Sandy Bridge, as well as results on heavy FMA-based GPUs, where percentages went as low as 30% for the linear case. The KNL performs upwards of 1.7 times better than Haswell (for the linear case, and more than 1.6x for modlinear) from a peak performance speedup of 2.28x, computed at base frequencies (Fig. (6b) ).
One of the most interesting results from our simulations was the realization that the KNL can suffer from a major downside when compared to the previous generation of Xeon Phis: while the KNC was a coprocessor, where the host processor would take the performance hit of running the operating system (OS), the KNL chip has to perform all the OS tasks itself. Our results show that this can indeed be an issue with regards to the number of cores available for HPC applications run on KNL, with the 63+1 configuration delivering the best performance (63 compute cores + 1 OS manager). For our application kernel, which is not memory bound by design, this result just proves additionally that the decision to use the quadrant clustering mode for our simulations is the right one. Therefore we argue that this inherent imbalance of the 63+1 workload distribution and the already quite good memory locality and access pattern of the data does not warrant the investigation of the SNC-4 clustering mode. Additionally, mainly due to this imbalance, a hybrid OpenMP -MPI scheme did not deliver better results than either the pure OMP or MPI implementations, as can be seen in the results in Fig. (5c) .
Lastly, regarding memory modes, as the dataset fits in the HBM, we see a degradation of performance in cache mode at the core-level, and, with no memory bandwidth issues for our code, virtually no difference between flat modes with and without localized allocation. At the node-level, for the OpenMP implementation we see again no changes in performance behavior, while for the MPI schemes a small reduction in performance on DDR4 was noticed.
The results from the node-level simulations prompted us to move forward with our tests only with the most efficient configuration (Fig. (6a) ). Thus, for the runs at the clusterlevel, we have set our KNL nodes in quadrant mode, pinning two threads per core for each of the 63 cores used in a node, and having one MPI task assigned per node, for a 1x(63x2) configuration. The Allreduce scheme was chosen for the communication between the nodes. The previous results on KNC, Haswell and other architectures at the cluster-level have shown the high quality of the Allreduce scheme for up to 512 compute nodes [6] . As such, even though our results are still in line with the scaling performances obtained previously, a more interesting aspect was the focus of our study: how does our kernel behave when dealing with memory requirements of the same magnitude as the HBM size. Therefore we have investigated training datasets of 2 26 , 2 27 and 2 28 instances. Please note that even though all results measure just aspects of the training of our classifier, the memory footprints have to account for all data used by the code, including the testing datasets of 2 22 , 2 23 , and 2 24 instances, respectively. (Powers of two and proportional ratios of training/testing dataset sizes were chosen for a more relevant comparison.) Our biggest dataset alone lies outside the MCDRAM at 20GB, with the second biggest cutting this number in half, making for a total footprint for that case a few gigabytes short of the 16GB threshold. Our smallest test, while still containing 5GB of training data, can fit more than comfortably in MCDRAM and thus was our control case.
The results at the HBM size limit are shown in Fig. (7) . At the scale of the plots no significant differences can be seen for the largest dataset for a low number of nodes, however, as the effects are cumulative, starting with 8 nodes we can clearly observe that cache mode offers better performance than flat MCDRAM. The allocation to DDR4 in flat mode also seems to perform well for problem sizes above 16GB. Overall, the best performance was obtained when running the KNL in quadrant flat mode with datasets that fill as much as possible, and are allocated in, the HBM. This way extra effects and latencies are kept to a minimum.
VI. CONCLUSIONS AND OUTLOOK
Our kernel, implementing manual vectorization and specific optimization techniques, performed very well on the targeted machine. The higher latencies of the AVX-512 instructions (compared to AVX2) needed the use of 2 hyperthreads per core. In a code with a good memory locality like ours, the quadrant mode of the KNL gave optimum results, backed also by the realization that our HPC kernel needed only 63 cores to maximize the performance of the chip. With a pure OpenMP scheme we have obtained speedups of 1.7x upwards at the node-level compared to the previous Haswell implementation, Fig. 7 . Scaling results of our optimized modlinear kernel on the CooLMUC-3 KNL cluster for up to 16 nodes. Sizes of the datasets make single node runs impossible. (For the largest dataset results on 2 nodes also do not fit in the 48 hours job limit of the cluster.) All three memory modes were tested, as the datasets used are of the same magnitude with the HBM size (16GB).
making for an improved runtime of our classification and cutting the training time of 2 18 instances of our chessboard dataset in 5-dimensional space from over 8.5 minutes to just shy over 5.5 minutes on a single compute node. With the current MPI communication scheme, our application showed to scale on our relatively modest cluster-level runs with input datasets of the same magnitude as the MCDRAM size, confirming the expected behavior of the KNL memory modes.
Overall, the obtained results prove that both our linear and modlinear kernels are ready to use the performance available on multi-and many-core architectures implementing the AVX-512 instruction set, and that manual vectorization, when done well, can unlock the full performance of the hardware. Future work will focus on testing and improving the scalability of our communication scheme on clusters larger than the one used so far, allowing for efficient processing of problems close to the DDR4 memory limit. Additionally, streaming scenarios might provide more scaling opportunities with less strain from inputoutput operations. With the lessons learned from this study we feel confident in moving forward to testing our data mining application on the Skylake family of processors, where we expect to see similar, close to optimal performance.
