Abstract-In the present article we describe the implementation of the finite element numerical integration algorithm for the Xeon Phi coprocessor. The coprocessor is an extension of the idea of the many-core specialized unit for calculations and, by assumption, its performance has to be competitive with the current families of GPUs. Its main advantage is the built-in set of 512-bit vector registers and the ease of transferring existing codes from normal x86 architectures. However, the differences between standard x86 architectures and Xeon Phi do not guarantee performance portability. We choose an alternative approach and, instead of porting standard multithreaded code, we adapt to Xeon Phi previously developed OpenCL algorithms for finite element numerical integration. The algorithm is tested for standard FEM approximations of selected problems. The obtained timing results allow to compare the performance of the OpenCL kernels executed on the Xeon Phi and the contemporary GPUs.
I. MOTIVATION
N RECENT years there has been a noticeable increase of popularity of programming with the use of graphic cards. Their computing power allows for significant acceleration of calculations for properly implemented programs. However, there is a price to be paid, in the form of complex programming model with a complicated memory organization [1] , [2] . Huge performance of GPUs can be seriously limited due to data transfers between different memory levels. Therefore, an important step is to design an algorithm that takes into account characteristics of memory access mechanisms for a particular architecture.
I
The development of multi-core architectures has resulted in many interesting ideas for further evolution of hardware for scientific and technical calculations. GPUs are an example of massively multi-core microprocessors with the large number of relatively simple cores equipped with small amount of memory. Another development trend in microprocessor architecture is to increase the amount and width of vector execution units within a single processor, clearly visible in recent general purpose cores [3] . The other idea was to combine the architecture of a general purpose processor with SIMD units encountered in graphics cards. The first example which achieved a fairly considerable popularity is the  This work is developed within the project DEC-2011/01/B/ST6/00674 financed by Polish National Science Centre CellBE architecture used in Sony Playstation 3 consoles and successfully adapted to the scientific purposes as PowerXCell 8i processor [4] . In our previous studies, we focused on the development of an algorithm for the finite element numerical integration on the aforementioned processors which resulted in the development of highly efficient implementations for higher order elements of the Discontinous Galerkin approximation [5] . At the same time authors developed a version for the graphic card which was then successfully redesigned to use a standard approximation of the finite element method [6] . The resulting version of the algorithm has been tested on different types of graphic cards and the results of these tests will soon be presented in [7] . Both mentioned architectures can be considered as predecessors of the new Intel Xeon Phi architecture. This architecture combines large number of cores with wide vector units in each core. Opposite to standard GPUs, coprocessor cores are less numerous and their complexity lies in between standard, general purpose cores and simple GPU cores. As in GPUs, Xeon Phi shares the same way of memory organization and therefore all codes developed for graphic cards should be easily adapted to coprocessor architecture, but as in all types of such architectures, data movement between different levels of memory may become an issue of primary importance [8] . With the introduction of Intel Xeon Phi numerical coprocessors, there is a need to test the previously developed algorithms on the new architecture and verify whether the widely advertised adaptability of existing codes also applies to the transition from GPU to coprocessor.
II. NUMERICAL INTEGRATION ALGORITHM
Numerical integration algorithm is one of the most important parts of the finite element method codes. FEM assumes the division of the whole computational domain into elements for which the integrals, corresponding to pairs of element basis functions, are calculated and the results are collected in local, element stiffness matrices. Local load vectors are also obtained through integration of corresponding terms. Final structure of the formula to calculate the example entry to the element stiffness matrix depends on the form of the weak statement for the considered problem and can look as (1) . 
Finite Element Numerical Integration on Xeon Phi coprocessor
In the formula above, C i D j D are coefficients that depend on the problem with iD,jD = 0,1,2,3 and ϕ i S , ϕ j S are global basis functions. In order to calculate the integrals we need to perform the change of variables, which means that the integration is made for a particular type of reference element. The transformation from the reference element to the real element is denoted by x(ξ). For a reference element we use shape functions instead of global basis functions and apply one of the forms of quadrature. In our case, we used one of the most popular Gaussian quadrature. This quadrature allows for the transformation of the integral to the sum over integration points within the reference domain. Number of integration points is dependent on the required accuracy of the calculations and the type of the reference element. For NQ integration points with coordinates ξ Q and weights w Q we can transform the integral (1) to the sum (2) .
Whereφ i S and φ j S are shape functions and J T e is the Jacobian matrix of transformation x(ξ).
Performance of numerical integration algorithm depends greatly on the problem being solved (weak formulation) and the approximation method employed. With the use of standard linear approximation the time of the creation of element stiffness matrix is relatively small. From the computational point of view, numerical integration algorithm consists of multiple independent calculations for each element. For this reason, the computational cost increases with growing number of elements. Calculated integrals correspond to the different terms in the weak formulation of the problem for which there is a need to define the matrix of coefficients for integration. Therefore, for the various problems we obtain different combinations of integration components for partial integrals of the test functions.
The problem dependent contribution mainly consists of the set of coefficient for numerical integration. Besides standard iD and jD indices that corresponds to the different spatial derivatives for test and trial functions, there can be also second pair of indices iE, jE. This indices are introduced, because for vector problems, the same approximation can be used for different unknowns in the solved system of partial differential equations (PDEs). Hence, for each combination of iD and jD there may be a small matrix of coefficients with the NE dimension equal to the number of equations in solved system of PDEs. Moreover, in the most general cases there may be different values of coefficients at each integration point. Hence for the generic numerical integration algorithm array of coefficient should be considered in a form 
The corresponding right hand side vector is calculated using the formula (4)
As the conclusion of the numerical integration problem definition we provide the algorithm for computing stiffness matrices and load vectors for a set of elements of the same type and the order of approximation : As we see from algorithm above, we can either read or compute most of the necessary components for numerical integration. This leads us to the conclusion that we can steer the amount of data sent from the memory and the amount of computations, depending on the available hardware resources and the problem solved.
In our case, we focused on the problem of convection-diffusion for NE = 0 in two cases -one with simple Laplace equation, where the coefficient matrix C is sparse and coefficients appear only on the main diagonal in the case of iD = jD (3 coefficients for stiffness matrices, one for the right hand side (RHS) vector) and a second with enhanced convection-diffusion problem for the full sixteen coefficients for stiffness matrix and four for RHS vector. Furthermore, for solving the Laplace task all coefficient were the same for all Gaussian integration points for stiffness matrix and different for the RHS vector. In the second, convection-diffusion task, all coefficients were constant for all Gauss points. For our reference elements we use prisms with 6 degrees of freedom. Our assumptions are illustrated by the data in Table I . For optimization of the data transfer we need to decide which coefficients should be computed on the host system side and which on the accelerator side. This depends on the available resources and the type of the solved problem. Amount of data to send/store for one element can be observed in Table II . 
Gauss data 24
Shape functions at point 24
Shape functions total 144

Geometric data 18
Jacobian terms at point 10
Jacobian terms total 60
Coefficients at point Laplace
4
Coefticients total 9
Coefficients at point Convectiondiffusion
20
Coefticients total 20
For the GPU implementation the most important part is a proper way of data transfer organization and utilization of a limited resources. In order to port the code to the Xeon Phi coprocessor we need to reorganize the code, based on the experience gained when implementing the numerical integration for the PowerXCell 8i architecture.
III. INTEL XEON PHI
With the development of multi-core architectures and a simultaneous trend of using the graphics cards for the calculation, an idea came up to combine several different architectures in a single hardware unit whose individual elements would be responsible for processing different type of code fragments. The first device of this type -mentioned earlier PowerXCell 8i processor was unveiled by IBM and was equipped with two core with IBM Power architecture (Power Processing Element) and a few specialized SIMD cores (Synergistic Processing Elements). Its hybrid design allowed for sending to SPE a pieces of code for which you can apply the SIMD paradigm in order to speed up calculations [9] . Truncated version of this processor has been successfully applied for commercial purposes in Sony Playstation 3 consoles and its scientific version was part of the Roadrunner computer which in 2008 exceeded the petaflops performance barrier [10] . PowerXCell 8i processor was a very big step in the development of architecture and despite the discontinuation of its production it has become a base used by other manufacturers for a hardware development for high-performance computing. At the same time Intel was working on its line of graphics cards codenamed Larabee trying to eliminate the main disadvantage in programming CellBE or GPU architectures, which is complicated programming model. The main features of this architecture was the use of a very wide vector units (512bit), texture units taken from the GPU, the coherence memory hierarchy and compatibility with x86 architecture [11] . On the basis of this project Intel Many Integrated Core (MIC) architecture was developed, which was successfully applied in Intel Xeon Phi coprocessors [12] . These coprocessors are sold as a PCI-express cards (Fig 1. ) and are equipped with its own operating system based on Linux, and depending on the version 57-61 cores with hardware multithreading support (4 threads per core).
For testing purposes, we used 5110P coprocessor equipped with 8GB of RAM and 60 cores with a speed of 1GHz. However, in order to function properly, a single coprocessor core and 2GB of memory are reserved for the internal operating system which results in a 236 available threads and 6GB of memory for performing the calculations [14] . MIC Architecture cores design is based mainly on the Pentium architecture but it is enhanced with 512-bit vector units. The x86 compatible architecture theoretically allows for easy transfer of existing code to be used on the coprocessor with a significant increase in performance. Fig 2. shows the internal structure of the single coprocessor core.
Every core is connected to the ultra fast interface, and thanks to a coherent cache memory, the data between cores are exchanged almost immediately (Fig. 3) .
IV. OPENCL PROGRAMMING MODEL
OpenCL is a software development platform that supports many kinds of available hardware, from standard CPUs, through hybrid architectures to the GPUs [16] . In recent years this platform has gained popularity due to its portability and similarity to the previously used CUDA programming model developed by Nvidia [17] . OpenCL code is compiled and run for a given platform, representing the environment for code execution. Each platform is equipped with sets of devices of three types: CPU, GPU or Accelerator. For one host system there could be many platforms installed, varying on the vendor and supported devices. Host system runs standard code and manages the execution of an OpenCL code on device. OpenCL code is called a kernel and is written in a slightly modified C language, with the special extensions to manage different types of devices. Each device in the platform is composed of compute units, that are further divided into processing elements. Individual threads are running on processing elements with capabilities depending on the architecture of device. In OpenCL nomenclature all threads are called work items and they are grouped into work groups. This allows for direct hardware mapping for different architectures. OpenCL programming model is shown on Fig. 4 . Threads within a single work group execute concurrently and can be synchronized using fast system calls. Moreover, they can share some of the data in their fast shared memory, called local memory in OpenCL nomenclature. Different work-groups are scheduled independently and have their own resources. OpenCL execution model specifies a set of events that has to occur in order to run a kernel. The first phase includes initialization of the OpenCL platform, data structures and checking the available devices. Then the kernel code has to be prepared (read or compile) for running on the devices. Because of the ability of passing arguments to kernels, the space for them has to be allocated on device, before executing a kernel. Host is also responsible for preparing and allocating the space needed for variables and arrays in different types of memory that are explicitly available to programmers. All memory transactions are performed by sending a request to the OpenCL management layer. Then the requests are realized asynchronously to the host code. The same strategy is used to request kernel execution or transfer of data, back from device memory to the host memory [18] . In OpenCL the programmers have several types of memory regions explicitly available to use. Each of memory objects can be created in OpenCL memory model with different mappings to hardware resources. Individual variables defined inside kernel belong to private memory. Each thread has its own copy of each variable, and they can be stored in scalar or vector registers. Other memory regions can be assigned through specific qualifiers. Typical memory regions are divided into three types -global, constant and local. Global memory stores variables that are visible to all threads executing the kernel. Constant memory is also available for all threads but it is only accessible for reading. Variables stored in fast local memory are shared by threads in a single work-group. Because of the portability of created code OpenCL contains procedures that allows for adapting to different platforms and devices [19] . The code can query the environment to get information about many available resources. For our case we compare the available resources of all three types of devices -CPU, GPU and Accelerator. The results are presented in Table  III . As we notice CPU and a Xeon Phi cards share the same amount of local and constant memory which indicates the same origins of this architectures. The Tesla K20m card used for testing our GPU implementations of numerical integration has bigger local memory size but less compute units. Hence, one can conclude that it should be slower than the other devices, but OpenCL hardware layer does not provide information on deeper division of compute units into processing elements. OpenCL in both, CPU and Intel MIC architectures treat their cores as a single compute unit but it ignores all CUDA or STREAM cores in GPUs. Our reference Tesla K20m card is equipped with 13 compute units with the Kepler architecture [20] that indicates that we have a massive amount of 192 processing elements per one compute unit, giving total 2496 cores available [21] . Despite of that all three architectures are treated as a direct opponents in the domain of high performance computing. This happens because each of these architectures has its own unique characteristics that allow for direct comparison between them (e.g. architecture of cores, clock speed, vector registers etc.). This determines also, that in order to fully exploit possibilities of the hardware, all existing algorithms should be adapted separately for each of the architectures. The task of numerical integration becomes non-trivial and therefore very interesting from the performance point of view.
V. NUMERICAL INTEGRATION ON INTEL XEON PHI
For our tests we use ModFEM code -a computational framework developed for solving various scientific and engineering problems by the adaptive finite element method [22] . Due to its modular structure it allows to test different levels of the FEM. Therefore, we can easily separate the numerical integration algorithm and make a parallel versions for different architectures.
For numerical integration algorithm we have several levels of parallelism available. The chosen way of parallelisation will depend on the size of data and the number of calculations in the solved problem. Therefore, we can choose which loop from algorithm 1 should be divided. On first level we can parallelize the outermost loop over elements. Then, we can divide the loop over integration points and subsequently two inner loops over shape functions. In our previous works [5] , [6] we have tested several strategies for higher order finite elements. Because of the quite big sizes of computed matrices in the problems described above, we have tested division of the inner loops over shape functions, and also a loop over Gaussian integration points. In the current case we decided to test standard approximation module. Therefore, our stiffness matrices and load vectors are quite small as we can see in Tables I and II . Hence, as the method of parallelization the most natural way of parallelizing the loop over elements is selected. As it was mentioned above we decided to test two cases for our implementation -small Laplace and big Convection-Diffusion problem. Moreover, we have tested this problems with the use of double precision and single precision variables to check the differences between DP and SP hardware units. For our tests on graphic cards we have tried two versions of kernels -one with the stiffness matrix and load vector stored in registers and the second with the matrices stored in shared memory. In this article we will reference to them by using acronyms REG_ONE_EL for register and SHM_ONE_EL for shared memory version. Both versions has their own advantages and disadvantages. The first one allows for using very fast registers, and it saves local (shared) memory for other data, but with the limited number of registers available it can easily cause register spilling and therefore lose the efficiency of the algorithm. The second version allows for saving fast registers, but it uses a slower shared memory. Our ONE_EL versions assume that the whole element is computed by the one work-group, although one work-group can (and should) of course compute more than one element. At a first stage, host code has to compute all necessary sizes of data and thus, all needed divisions of the loops. For our reference platform we use a system equipped with NVIDIA Tesla K20m GPU, whose parameters are presented in Table III. The main difference that we must assume during the transformation of the GPU algorithm for the Xeon Phi implementation, is the size of warp/wavefront. This size (equals 32 for NVIDIA or 64 for AMD) indicates the minimal size of work-group that should be used on a given device. Due to the hardware division of every compute unit of Tesla GPU, we must also provide proper (high enough) ratio of compute unit occupancy. According to [23] , Intel Xeon Phi fully utilizes its vector registers when the workgroup size is set to 16. This allows for the most optimal automatic vectorization that can fully use the advantages of a very wide vector registers to store variables and use vector computations on the hardware units. Other difference lay in the use of the shared memory, because all OpenCL memory levels are mapped into Xeon Phi global memory. Hence, the use of shared and constant memory should be minimized and all possible data should be declared locally to allow proper vectorization. Of course, in the case of such a complicated algorithm there is no possibility to fit all data in registers, so we must find a proper way of preparing and storing the data. For these reasons, in opposite to GPUs SHM and REG versions that assumes only stiffness matrix allocation, we have considered more complex options for Xeon Phi. For our tests we use a computational domain with 782336 prismatic elements. Because of the minimal work-group size that should be used for a certain architecture this indicates that we have to compute data of 785408 elements on Xeon Phi and 798720 on GPU, which in this second case is 16384 elements more than our computational domain size. While this amount seems to be very large, in fact it is only 2% more calculations and it is absolutely necessary for the proper mapping to the hardware. Due to the fact that one work-group has to compute 64 elements at once, we must divide the number of elements per compute unit by this size, so we will receive 832 work-groups that will work on 960 elements. For our Xeon Phi accelerator we have accordingly 236 work-groups with 208 elements to compute. Therefore, for GPU we have a total number of 53248 threads, while for Xeon Phi there are only 3776 threads. All precomputed values needed for calculations are shown in Table IV . After calculations of all necessary divisions, the space needed for calculation is computed, and the data preparation phase begins. At this stage all needed buffers on the kernel side are prepared and the necessary data are computed. For our algorithm we need the following data:
-execution parameters -all values earlier computed on the host side that may be necessary for our computationse.g. the number of elements per kernel and per work-group. This data can be stored in constant memory because we do not need to change it. For Xeon Phi case where constatnt memory is a part of global we can assume direct read from the global memory. -Gauss points data -all necessary Gaussian integration points data -their coordinates and associated weights can also be stored in constant memory or read from global in Xeon Phi case.
-Values of the shape functions and their derivatives on a reference element -needed for all Jacobian calculations and obtained in the same way as previous data.
-Geometric data (coordinates) for all elements -stored in global memory of the device. Here we can assume several different cases -we can copy it to local memory for each element separately (main method for REG version), copy it in coalesced way for all elements in work group (main method for SHM version) or use it directly form global memory (Xeon Phi).
-Problem dependent coefficients -send to global memory for all elements. Here we can repeat the methods from the geometric data above but for Xeon Phi we also decided to copy it directly to the registers to speed up the calculations. After preparing the data above we can start our computations. Firstly if we are using shared and constant memory we must read all necessary execution parameters, Gauss data and values of the reference shape functions. At this stage for SHM version we have to declare local arrays for stiffness matrix and load vector. After preparation we are entering the outer loop over elements processed by a thread. According to the Table IV for Xeon Phi it is 208 elements per work-group of size 16 which indicates that each thread has to compute 13 elements, while for Tesla it will be 960 elements per work-group of size 64, that results in 15 elements per single iteration. Inside this loop we are reading all geometrical and coefficient data for one element. As it was mentioned above for SHM version we can organize this data for so-called coalescent access which theoretically enable higher performance of data transfer allowing for simultaneously read all data by all threads within one work-group. For Xeon Phi we can use global memory directly. Because of the use of the local memory on GPU after reading this data we need to establish a synchronization point with the use of a barrier, which can slow the flow of calculations a little bit in opposite to Xeon Phi and its direct global memory access. The next step includes defining (for REG and PHI versions) and zero the local stiffness matrix and load vector. Afterwards, we are entering the loop over Gauss points where we have to compute the Jacobian transformation matrix and its inverse on the basis of the previously obtained Gauss and geometrical data. After this calculations we are entering the innermost loops over the shape functions. After computing the values of shape functions and their derivatives for a real element based on their values for the reference element and earlier computed Jacobian matrix, we can compute a final entry to the stiffness matrix and load vector according to the algorithm 1. For SHM version we need to compute the right offset for storing the computed matrix in local memory. After computations for each Gauss points we can send the data to the device global memory. After all computations, the data stored are read back to the host system memory where they can be checked and used for further FEM computations. The amount of data send to and received from device global memory is shown in Table V . Time of kernel execution (Fig. 6) shows more differences depending on the tested problem and used algorithm. A version with the use of shared memory turns out to be non optimal in our cases, but Xeon Phi is a much more faster than the Tesla card. What is more interesting we can see that the REG version with stiffness matrix stored in registers is slightly more faster than the PHI version for a small Laplace problem. All this advantage is lost when we use more complicated Convection-diffusion problem. This may indicate that Xeon Phi need quite big amount of data to fully utilize its vector registers and take advantage of it. Unfortunately, all this gained performance is lost during the copying the output data back from the accelerator to the host memory (Fig. 7) . As we see on Xeon Phi the organization of the global memory has no impact on the obtained results, in opposite to the Tesla card. Table VIII shows the obtained results in Gflops -basing on that we can see that our algorithm reaches almost 15% of theoretical peak for both double and single precision according to [24] . This can lead us to the conclusion that there is a certain margin of performance that can be used for further optimization. As we have shown in this article Intel Xeon Phi could be an efficient and easy to use hardware for the finite element calculations. However, it needs quite big changes in actually developed GPU codes. In our further work we will try to manually vectorize the calculations and change the data retrieving algorithm to be more efficient. The first method will allow for comparing the automatic vectorization option of the compiler and check if it fully utilizes very wide 512-bit vector registers. Second method will allow to catch up with the Tesla GPU speed of data transfer and will let to make a full comparison of the competitive architectures. 
