Trade-off between the cost-efficiency of powerful computational accelerators and the increasing energy needed to perform numerical tasks can be tackled by implementation of algorithms on the Intel Multiple Integrated Cores (MIC) architecture. The best performance of the algorithms requires the use of appropriate optimization and parallelization approaches throughout all process of their design. Monte Carlo methods and Quasi-Monte Carlo methods depend on a huge number of computational cores. In this paper we present the advances in our studies on the performance of algorithms for solving multidimensional integrals on Intel MIC architecture and their comparison with the performance of Monte Carlo methods. The fast implementations are due to the high parallelism in the operations with the many coordinates of the sequences achieved with the Intel MIC architecture. These implementations are easy to be integrated and demonstrate high performance in terms of timing and computational speeds.
Introduction
The Intel's Many Integrated Core (MIC) architecture is used in the Intel Xeon Phi line of processors, which are used as co-processor cards in the first generation, but can be used as fully functional main processors in the subsequent editions. In our current high-performance computing system they are used as co-processors in servers equipped with standard Intel Xeon CPUs. Even with this limited functionality the Intel Xeon Phis compete with GPU cards for the role of accelerators in heterogeneous systems where they can significantly improve the computational efficiency and power consumption with respect to those resulting from use of standard CPUs. Details on how to program for the Intel Xeon Phi coprocessors can be found in [1] . Some of their main characteristics are: 1) equipped with vector units that allow processing of several integer or floating point numbers at once; 2) running many cores at low frequencies and 3) availability of hyperthreading. In practice if the vector instructions on such accelerators are not used, then they perform as slow regular CPUs. The Intel compilers available in the Parallel Studio XE 2016 package provide direct access to the vector instructions via compiler intrinsics, thus facilitating the use of vector instructions by the program developers. Another 
Abstract
Trade-off between the cost-efficiency of powerful computational accelerators and the increasing energy needed to perform numerical tasks can be tackled by implementation of algorithms on the Intel Multiple Integrated Cores (MIC) architecture. The best performance of the algorithms requires the use of appropriate optimization and parallelization approaches throughout all process of their design. Monte Carlo methods and Quasi-Monte Carlo methods depend on a huge number of computational cores. In this paper we present the advances in our studies on the performance of algorithms for solving multidimensional integrals on Intel MIC architecture and their comparison with the performance of Monte Carlo methods. The fast implementations are due to the high parallelism in the operations with the many coordinates of the sequences achieved with the Intel MIC architecture. These implementations are easy to be integrated and demonstrate high performance in terms of timing and computational speeds.
Keywords: Monte Carlo methods, Intel MIC architecture, performance analysis
Introduction
The Intel's Many Integrated Core (MIC) architecture is used in the Intel Xeon Phi line of processors, which are used as co-processor cards in the first generation, but can be used as fully functional main processors in the subsequent editions. In our current high-performance computing system they are used as co-processors in servers equipped with standard Intel Xeon CPUs. Even with this limited functionality the Intel Xeon Phis compete with GPU cards for the role of accelerators in heterogeneous systems where they can significantly improve the computational efficiency and power consumption with respect to those resulting from use of standard CPUs. Details on how to program for the Intel Xeon Phi coprocessors can be found in [1] . Some of their main characteristics are: 1) equipped with vector units that allow processing of several integer or floating point numbers at once; 2) running many cores at low frequencies and 3) availability of hyperthreading. In practice if the vector instructions on such accelerators are not used, then they perform as slow regular CPUs. The Intel compilers available in the Parallel Studio XE 2016 package provide direct access to the vector instructions via compiler intrinsics, thus facilitating the use of vector instructions by the program developers. Another
The Intel's Many Integrated Core (MIC) architecture is used in the Intel Xeon Phi line of processors, which are used as co-processor cards in the first generation, but can be used as fully functional main processors in the subsequent editions. In our current high-performance computing system they are used as co-processors in servers equipped with standard Intel Xeon CPUs. Even with this limited functionality the Intel Xeon Phis compete with GPU cards for the role of accelerators in heterogeneous systems where they can significantly improve the computational efficiency and power consumption with respect to those resulting from use of standard CPUs. Details on how to program for the Intel Xeon Phi coprocessors can be found in [1] . Some of their main characteristics are: 1) equipped with vector units that allow processing of several integer or floating point numbers at once; 2) running many cores at low frequencies and 3) availability of hyperthreading. In practice if the vector instructions on such accelerators are not used, then they perform as slow regular CPUs. The Intel compilers available in the Parallel Studio XE 2016 package provide direct access to the vector instructions via compiler intrinsics, thus facilitating the use of vector instructions by the program developers. Another 1 possibility is the direct coding of assembly instructions inside C codes, which is more complex but still feasible for Xeon Phi accelerators. The different ways of using Intel MIC accelerators and their inherent composite structure lead to multiple parallelization approaches that can be applied to such systems. When one tries to implement then in practice, various steps have to be chosen and the particular target system guides these choices.
For our tests we used the Avitohol High Performance System, built with Xeon Phi 7120P accelerators, hosted at our institute. Avitohol consists of 150 servers SL250S equipped with both dual Xeon CPU E5-2650 v2 at 2.60GHz and dual Xeon Phi 7201P accelerator cards. The total accessible RAM on the system by the regular CPUs and the accelerator cards are 9600 GB and 4800 GB, respectively. The operating system on the servers is Red Hat Enterprise Linux, while Intel's own special version of Linux OS (part of the MPSS package) is installed on the accelerators. Currently the exact versions on the servers and for the MPSS are 6.7 and 3.6-1, respectively. This system achieved 332th place in the Top 500 list when it entered operation, with a theoretical peak performance of about 413 TELOP/s, of which 90% is contributed by the accelerators. One can conclude that the optimal use of accelerators is the only way to fully leverage the power of such kinds of systems. However, many software packages do not have optimised versions for accelerators.
Monte Carlo simulations are an important type of computations routinely run on HPC systems like Avitohol. They are especially important for modelling of real-world phenomena, when deterministic methods are not yet developed. Typically Monte Carlo methods are amenable to efficient parallelisation. The computer implementations of Monte Carlo usually employ pseudorandom generators when sampling random variables and there is large amount of research done on quality and performance of pseudorandom generators. When using them on HPC systems it is important to be able to use parallel independent streams of pseudorandom numbers, otherwise results from simulations on different processors may coincide or be correlated, effectively removing the advantage of parallelism or introducing bias in the results. When developing a computer implementation of a Monte Carlo algorithm one usually relies on using an established library containing pseudorandom number generators like the specialised SPRNG package ([2]) or the general Intel Math Kernel Library (MKL).
In this work we used the Intel's implementation of the set of 6024 Mersenne Twister pseudorandom number generators MT2203 from Intel MKL, since parameters of the generators are meant to provide mutual independence of the corresponding sequences. We consider this to be one reasonable choice when implementing Monte Carlo algorithms. One alternative way of implementing stochastic simulations is to use scrambled quasi-random sequences. These sequences in some cases may be used as drop-in replacement for pseudo-random numbers in Monte Carlo algorithms, with the aim to achieve faster convergence. Because they are fully or partially deterministic, they provide better theoretical error bounds for suitable sets of functions. In many algorithms they also behave well in practical use, achieving convergence rates that are substantially better than the usual O N −1/2 rate of Monte Carlo methods. One established measure for the quality of quasi-random sequences is the star-discrepancy D ⋆ N , which is related to the integration error via the classic Koksma-Hlawka inequality (see e.g., [6] ). By using this measure one can single-out the class of low-discrepancy sequences, which are those that can be proved to have a rate of convergence of their discrepancy to zero of O (N −s log s N ). This rate is presumed to be the optimal one ([11] ). Other measures of irregularity are also used, in order to cover other function classes or use cases, for example when the functions to be integrated are periodic. The practical behaviour of low-discrepancy sequences in an algorithm is highly dependent on the nature of the function to be integrated. In practice the observed rates of convergence are frequently better than what one would expect from the theoretical
