Abstract: Intel's recent manycore processor KNights Landing (KNL) promises high performance for scientific applications. Careful tuning for the complex chip architecture is required to efficiently exploit the chip's hardware resources. This paper describes performance improvement techniques and demonstrates their effectiveness for scientific applications. Experiments were conducted with some of the National Aeronautics and Space Administration (NASA's) advanced supercomputing (NAS) parallel benchmarks, and the effectiveness of: 1) advanced vector extensions (AVX-512) vectorisation support; 2) manycore threading support; 3) the utilisation of thread affinities for different KNL modes, was analysed.
Introduction
High performance computing (HPC) constantly evolves based on novel computing architectures, scalable algorithms (Müller et al., 2018) , and energy efficient solutions (O'Brien et al., 2017; Kaushik and Vidyarthi, 2018; Xiong et al., 2017; Digalwar et al., 2017) for solving science problems, year by year. These updates have consistently retained HPC researchers over decades for solving the emerging challenges and fine-tuning the available solutions at various levels of implementing scientific applications.
In the past, supercomputer machines had only CPU cores rather than employing hardware accelerators. In order to improve the speed of certain algorithms, mostly data parallel algorithms, supercomputers had included heterogeneous architecturesi.e., the general purpose CPUs were assisted with hardware accelerators. In fact, such heterogeneous architectures have reached top ranks in the Top500 list of supercomputing machines (Top500, 2018) . Accordingly performance improvement (PI) options for accelerator-based applications evolved Jararweh et al., 2017) .
In recent years, Intel's KNights Landing (KNL) architecture (Sodani, 2015) has attracted several HPC application developers (Adams et al., 2017) , especially data scientists working on machine learning or deep learning or 3D visualisation concepts, owing to the possibility of providing over peta-scale performance using KNL CPU cores -KNL achieves the accelerator performance with its bootable processor feature.
Additionally, KNL has a scalable system framework which could be reconfigured to obtain varying compute units, integrated memory, and fabrics. However, it requires a diligent tuning of applications in order to accomplish the maximum performance. This includes, enabling vectorisation support, optimising memory access patterns, pinning required number of threads/processors (based on data locality), and so forth, for scientific applications.
The contributions of the paper are listed as follows:
1 It explores single core tuning techniques.
2 It investigates the tuning techniques for utilising the many cores of the KNL.
3 Iit manifests the performance efficiency aspects of KNL architecture due to choosing appropriate KNL memory modes.
Experiments were conducted by executing the hybrid and openMP versions of NAS parallel benchmarks (NPB) at the KNL partition of Tarus, Technical UniversityDresden, Germany. NPB are, in general, widely utilised by several researchers for analysing the performance of applications and architectures (Rosales et al., 2016; Hassan et al., 2016; Narayana et al., 2017) . The remaining sections of the paper contain the following: Section 2 discusses previous works; Section 3 describes the KNL architecture; Section 4 describes the supportive performance efficiency features of applications on KNL; Section 5 illustrates the findings towards vectorisation, manycore features, and data locality options for the NPB and Section 6 presents a few outlooks and conclusions.
Related work
Performance analysis of HPC applications (Mishra and Mishra, 2017; Li et al., 2017; Beserra et al., 2017; Gajić et al., 2017; Gong et al., 2018; Benedict et al., 2017) and tool assisted performance monitoring of applications are remaining as a mandatory step to HPC or cloud application developments for decades. Tools such as paradyn (Miller et al., 1995; Roth and Miller, 2006) , periscope tool (Benedict and Gerndt, 2012) , scalasca toolkit (Geimer et al., 2006) , tuning and utility (TAU) (Shende and Malony, 2005) , vampir (Knufer et al., 2008) , mpiP (Jerey and Chambreau, 2018) and so forth had emerged for analysing the OpenMP or message passing interface (MPI)-based scientific applications on various architectures.
As newer architectures and applications are evolving, there were efforts to tune the performance of applications with minimal user efforts. The performance tuning efforts were based on performance metrics such as execution time, number of threads, energy consumption, utilisation factor, and so forth; the tuning processes were intended to be a manual or an automatic process. A few researchers have studied the ways to tune linear algebra kernels for improving the energy efficiency of them (Jakobs et al., 2018) , a few researchers have framed an energy tuning framework that automatically exploits the dynamism of applications (Schuchart et al., 2017) . ?l and Wasi-ur-Rahman et al. (2017) have tuned mapreduce-based applications with a few tuning parameters.
In the recent past, Intel's many integrated core (MIC) architectures have attracted industrial and academic researchers for tuning applications to accomplish high performance. Intel has motivated researchers by releasing a catalog of applications for MIC machines (Intel, 2016) . Accordingly, a few researchers (Prace Info., 2017) have analysed the potential of utilising MIC for their applications. For instances, FeastFlow was analysed by Venetis et al. (2016) , scalability and data parallelism patterns were studied for stencil applications (Cebrián et al., 2017) .
Memory access patterns and the hierarchical aspects of KNLs memories were discussed by a few researchers in the past. For instances, Perarnau et al. (2016) have discussed the PI achieved for applications while migrating data between the different memories of the KNL. Doerfler et al. (2016) have provided a theoretical analysis of memory bandwidth and memory hierarchies using visual roofline performance models.
The scalability of application on KNL or KNights Corner (KNC) was studied by a few researchers. For instances, Dykes et al. (2016) have studied the scalability feature of KNL machines while experimenting a Splotch visualisation algorithm. Their comparative study of MIC and graphical processing units (GPUs) has shown that hybrid programming of applications could improve the performance of Intel's KNC. Zhang et al. (2017) have endeavored to study the efficiency of executing deep learning models on multi-node KNL machines. Their early results, as shown in their paper, achieved around 80 percentage efficiency in scalability results for caffe application. Haidar et al. (2016) have studied the scalability aspects of algorithms such as lower upper (LU), QR, and cholesky factorisations. They proposed a programming model to efficiently utilise many core machines such as KNL.
Recently, with the emergence of Intel's KNL architecture, HPC researchers have endeavored to fine tune their applications or do a performance analysis study of applications in order to ease parallelism on a KNL machine. For instances, Barnes et al. (2016) have analysed National Energy Research Scientific Computing Cente (NERSC) workloads on KNL; Afanasyev and Voevodin (2017) and Liu et al. (2017) have manifested that executing graph algorithms in KNL is better than GPUs -in these works, the authors have investigated the cases where manycore features and scalability features of applications could be improved in KNL; Hirokawa et al. (2018) have studied the PI options while porting and executing their Ab-initio real-time electron dynamics (ARTED) code on KNL machines. This paper has studied the Intel KNL architecture and the PI techniques for HPC applications on KNL.
Intel KNL architecture -an overview
This section describes the overview of Intel's KNL architecture and its single node configuration settings in detail.
KNL architecture
Fundamentally, a KNL processor is an extension of a silvermont atom-based microarchitecture which belongs to the MIC series of servers/workstations. The processor units, memory units, IO/memory controllers, and the interconnection units of a KNL machine, are described as below.
Processor units
KNL processor is a bootable manycore architecture with over 8 billion transistors in it. The smallest replicated structure of the KNL processor is called a tile. In KNL, each socket has 36 tiles each tile comprises of two cores and each core embodies two hardware threads and two hyperthreads (see Figure 1 ). In addition, the processor owns two additional tiles for recovery or for industrial maintenance purposes. The processor is not dependent on serial computer expansion bus standards such as peripheral component interconnect (PCIe), peripheral component interconnect-extended (PCI-X), and so forth; it operates at 1.3 to 1.5 GHz. This architecture, therefore, ease scientific applications to utilise over 256 threads. A pictorial representation of the architecture is shown in Figure 1 . 
KNL memory units
KNL includes several levels of memory hierarchies such as, 1 High bandwidth L1 and L2 caches -i.e., L1 cache resides inside a core and L2 cache is shared within the cores of a tile.
2 Two double data rate (DDR4) memory units and 3 Four units of a 16 GB multi-channel dynamic random access memory (MCDRAM). The MCDRAM could be configured in different ways such that it could feed processors with high bandwidth data of over 450 GB/s (see Section 3.2).
Interconnections
A KNL processor is designed such that a tile is interconnected with the memory/IO controllers (MC/IO), the distributed tag directories (DTD) of tiles, and the other tiles of the node using a mesh ring topology. This interconnection is laid in the IC (die). Hence, KNL is also named as an on-die mesh interconnect architecture. This interconnect could be configured to achieve the variable high performance interconnects over 700 GB/s.
KNL single node configurations
Numbers of configurations and varieties of performance options could be laid out in KNL, which cater the needs of scientific application developers. The single node configurations of KNL are classified in this paper as memory mode and cluster modes of KNL (see Figure 2 ).
Memory mode configurations
The memory units of KNL could be configured in three different modes namely, flat mode, cache mode, and hybrid mode. These memory modes are initiated in the basic input/output system (BIOS) settings of a machine. Rebooting KNL for changing the memory modes is, therefore, necessary as the appropriate circuits of the KNL circuit board have to be powered on. The characteristics of these memory modes of KNL are discussed as below:
1 Flat mode: If KNL is configured as a flat memory mode machine, it operates the available blocks of MCDRAM units as an addressable memory unit of a machine.
2 Cache mode: In this mode of operation, MCDRAM could be utilised as cache memory when applications were executed. Hence, applications that require high bandwidth processing may suit well with this mode. However, in worst case scenarios, especially when cache misses occur, the performance of applications might not perform well.
3 Hybrid mode: Profiting exclusively by the flat mode or the cache mode might not be a candidate solution for many clustered applications. In hybrid mode, a few portion of the memory units are configured as flat and the other portion as cache. For instance, if an MPI application needs to be executed on a cluster, then hybrid memory mode of KNL could be a performance efficient choice.
Only a few applications will be successful in this mode of operation because the performance of the application might suffer if the data is not available in the nearest MCDRAM block of KNL. 
Cluster mode configurations
In general, application data, including the instructions required for processing an application, are accessed in caches. If they are not found in the cache, the CPU controller requests the memory addresses of an application from the next level of memory in a machine. Each tiles of KNL maintains a distributed tag directory (DTD) to achieve cache coherency. Thus, if the DTD of a tile (often referred to as a local cache) finds a cache miss, it reports the miss to all available DTDs. Subsequently, the missing memory address will be found from any one of the memory units. Although KNL maintains the cache coherency, it offers a few clustering options so that the memory addresses of an application are invoked with limited latency. The advantages and the characteristics of these clustering modes are described as follows:
1 All-to-all cluster: During this clustering mode of KNL, a tile queries all DTDs and searches for the required memory addresses from any available tiles during the cache misses. In this approach, the application might experience hefty latencies while finding the appropriate cache line. Hence, all-to-all cluster mode is not often set as a default mode for executing applications.
Advantages of KNL
Scientific application developers and the users could be benefited with this architecture in various ways as listed below:
1 The most important advantage of KNL is that it has the capability of working as a standalone booting machine. In contrary, its previous architecture, KNC, requires a host processor to offload instructions to accelerators for imparting parallelism. Hence, KNC had performance concerns while transferring data from the host processor to co-processors;
2 Due to the availability of over 256 working threads in a socket, the existing scientific applications could be scaled well. In addition, several parallel algorithms or energy efficient scalable applications could be developed based on this architecture; and, 3 An application need not be recompiled for MIC standards. Hence, any scientific applications that are developed for a generic computing architecture could be experimented with KNL.
PI techniques
This section discusses on the PI techniques of scientific applications while executing them on KNL architectures. KNL is designed such that the performance of an application could be tweaked at various stages of its execution, including the boot time of a machine. In this subsection, a few PI options of KNL are categorised and discussed (see Figure 3) . 
Boot time oriented PI approach
As KNL offers a few pre-defined BIOS configurations such as, selecting the clustering and memory modes of KNL at its boottime, the machine could be configured understanding the characteristics of applications. For instances, a few scalability aware parallel applications require sufficient caches in order to avoid cache contention; and, a few other applications that have irregular data access patterns require flat memory mode.
Compile time oriented PI approach -vectorisation
Vectorising a few code portions of an application is an important aspect of KNL. The performance of a few applications, in general, could be immensely improved while enabling vectorisation flags such as -xMIC-AVX512 or xMIC-AVX2, or so forth. The acronymn AVX stands for advanced vector extensions. For instances, the performance of the gather and scatter instructions of scientific applications (which have irregular data access patterns) could be improved when the KNL specific (hardware pre-fetching) compile time flags (-xMIC-AVX512-PF) were enabled. Similarly, aligning data at compile time avoids unnecessary cache misses in KNL. Thus, in succinct, choosing appropriate compiler optimisation switches improves the performance of applications.
Run time oriented PI approach
In addition to choosing a few compiler optimisation flags, the performance of KNL-based applications could be improved by tweaking applications at runtime. This includes mechanisms such as, imparting scalability and energy aware solutionsselecting the right number of threads and processes at runtime could improve the performance of an application. The users of KNL could apply a few environmental variables for enhancing the performance of applications. For instances, numactl linux commands could be utilised for binding applications to a specific NUMA node of KNL and KMP AFFINITY variable could be set for providing the data locality of applications.
Another approach is to utilise online performance analysis tools for identifying and manually rectifying the performance concerns of applications (Benedict et al., 2015; Benedict and Gerndt, 2012; Geimer et al., 2012) . There exist several online performance analysis tools such as, periscope, scalasca, TAU, and so forth. And, the energy specific analysis of applications could be accomplished using a few tools such as, powertop, score-E, energy analyser (Benedict et al., 2015) and so forth (Hahnel et al., 2015) .
Apart from normal monitoring and analysis, autotuning tools could be applied for automatically tuning a few portions of applications with respect to the underlying machines. To note, KNL oriented autotuning tools are either not available or they are not in the production stage.
Program development time PI approach
Utilising an appropriate programming language for writing scientific applications could enhance the performance of applications. Since KNL has over 256 working threads (including different clustering modes), OpenMP or buffer-based MPI programming models or MPI+OpenMP programming models could be utilised for writing applications.
Scientific applications could be modified to include SIMD intrinsic at higher level programming in order to support AVX512 instructions. These intrinsic are, in general, written in programs to enable compilers to vectorise the code. The impact of utilising such intrinsic in programs could be analysed by checking the assembler code of the program. Notably, there exists a few assembly level programmers to improve the performance of applications, including vectorisation.
Post runtime oriented PI approach
A few researchers analyse the performance of scientific applications after executing them. In fact, there exist tools that analyse the performance or vectorisation efficiency of applications. For instances, Intel advisor tool from Intel inc. has a provision to undergo survey analysis and trip count analysis of applications. To do so, the application has to be compiled with the MIC-based compiler flag (-xMIC-AVX512) prior to its execution. Once after executing the application, the trace report will be generated which could be further analysed for performance problems.
The survey report analysis is utilised to understand the bottlenecks of applications; it pinpoints to any inefficient portions of the vectored code; it suggests for vectoring a few portions of code; and, so forth. In trip count analysis, the tool specifies the number of occurrences of loops. Scientific applications could be traced and the performance insights of these applications could be verified by a few performance analysis tools with robust visualisation of parallel runtime.
Experimental results
The previous section classified a number of PI techniques into boot time, compile time, and run time techniques. This section presents the results obtained for a number of NPB and the experimental setup.
Experimental setup
All experiments, discussed in this paper, were executed at the KNL nodes available at the Zentrum für Informationsdienste und Hochleistungsrechnen (ZIH) -Technical University Dresden. For conducting experiments, the KNL nodes were remotely accessed and the applications were executed on the KNL node by submitting the jobs using the slurm batch system of the Taurus machine. In order to pursue experiments with varying KNL configurations (which should be modified at the BIOS mode of the machine), the nodes were rebooted with specific configurations and jobs were scheduled to these nodes based on Slurm reservations.
In the experiments, NAS benchmark version 3.3.1 MultiZone (MZ) ( Van der Wijngaart and Jin, 2003) suite was utilised which is widely utilised by HPC researchers in various contexts in order to understand the parallel performance of machines or algorithms (Sundriya and Sosonkina, 2018) . In addition, the NAS benchmarks were designed such that the number of processors, problem size, memory usage, and so forth can be defined prior to their execution. Throughout the experiments, the class C problem size (16 x 16) of NAS was compiled using the intel compiler with the Intel MPI library.
In succinct, the NAS-version-3.3.1-MZ includes three hybrid benchmarks named as block tri-diagonal (BT) solver, scalar pentadiagonal solver (SP) and LU Gauss siedel solver. These benchmarks are written in Fortran77 with MPI+OpenMP. These three pseudo applications solve partial differential equations (PDE) resulting from the Naiver stokes equation. During the execution, the rectangular problem size is discretised into tiles of 3D meshes. Although these applications attempt to solve the same PDE, the utilisation of networking zones vary among BT, SP and LU. For instances, the BT application has increasing number of networking zones with varying problem sizesi.e., BT-class A problem size has x = 4, y = 4, z = 1 zones while BT-class C has x = 16, y = 16, z = 1 zones and BT-class D has x = 32, y = 32 and z = 1 zones, SP and LU has fixed number of networking zones with increasing problem sizes -i.e., classes A, B, C, D and E of SB and LU have x = 4, y = 4 and z = 1 networking zones.
Boot time oriented PI options
The discussion in this subsection was driven in two parts:
1 At first, the applications when experimented with the SNC-cache mode of KNL was illustrated. In addition, the impact of performance of applications while applying thread affinities namely, compact, scatter, and balanced, was analysed.
2 Later, three different combinations of clustering and memory modes of KNL (SNC-cache, quadrant-cache, and SNC-flat) were compared for these applications.
SNC-cache mode
Here, the NAS hybrid applications (MPI+OpenMP) were executed on the KNL machine using the SNC with cache mode. The NAS applications were run with four MPI ranks and 36 threads per rank, totaling to 144 threads. In the experiments, a few environmental variables, as shown below, were utilised to initiate the defined affinities. For instances, export KMP_AFFINITY=scatter,verbose export KMP_AFFINITY=compact,verbose export KMP_AFFINITY=balanced,verbose
While testing them with the SNC-cache mode of KNL, OpenMP threads were assigned to the KNL tiles in three different thread affinity policies: Figure 4 shows that scatter and balanced affinity policies and SNC-cache mode result in better execution time compared to the compact policy.
Comparison of SNC-cache, quadrant-cache and SNC-flat
With the same number of threads and MPI ranks, the applications were executed in order to understand the impact of three KNL modes, namely, SNC-cache, quadrant-cache and SNC-flat modes. For instance, SNC-cache means that the KNL node was configured with SNC-4 and cache memory modes. The number 4 in SNC-4 resembles the number of blocks of tiles in a KNL. The modes of KNL were explained in the previous section. Similar explanation follows to the other modes of KNL. In these experiments, applications were fixed to the scatter thread affinity option. Table 1 shows the execution times of applications in seconds when executed at three memory modes of KNL. It can be noticed from this table that the SNC-flat mode of KNL had worst performances when compared to the other modes. This was due to the fact that the NAS applications had to refer to the MCDRAM memory units as addressable memories in the case of SNC-F mode. Whereas, the first two columns of the table, referring to SNC-cache and quadrant-cache modes, had utilised the high bandwidth capability of KNL (as MCDRAM acted as cache).
Impact of vectorisation support -vec2 vs. vec512 (compile time oriented)
In general, vectorisation is effective for data intensive applications that have non-dependent loops. Here, vectorisation was applied to the NAS applications using compiler flags. These flags enabled the generation of AVX2 instructions and AVX512 instructions based on the flags (-xCORE-AVX2 and -xMIC-AVX512). Figure 5 vividly pinpoints that the applications vectorised for AVX512 (KNL-specific) surpassed AVX2 for all cases. Notably, LU and BT had a hefty PI when comcompilled forfor the AVX512 instructions. 
Effects on number of threads (runtime oriented)
It was also interesting to understand the efficiency of KNL in terms of number of threads. To illustrate this scenario, in this experiment, the OpenMP version of NAS parallel benchmark-BT class A was run. In addition, experiments were conducted considering different thread affinities while increasing the number of threads from 1 to 256. Table 2 shows the execution time of NAS-BT. As seen in the table, the execution time improved with a increasing number of threads. However, the performance started to deteriorate from 128 threads. These findings got reflected in the three affinity cases of the performance study of NAS-BT. 
Conclusions
Many-core architectures are promising architectures for scientific application developers belonging to the different scientific communities, including the machine learning-based AI community. Although a few simulation-based research efforts have been carried out in the past, the performance analysis study of scientific applications in the real test environment is limited. This paper explored and categorised the PI options of scientific applications on KNL machines. Experiments were conducted for NAS hybrid pseudo-applications at the KNL machine of ZIH-TU-Dresden and the performance results from the experiments were discussed.
