Code structure overview
Over the last few years the complexity of architectures and algorithms for lattice gauge theory simulations has got higher and higher at an unprecedented fast pace. The legacy FORTRAN code used at KEK since the acquisition of the new generation of machines (IBM Blue Gene/Q and Hitachi SR16K) could not deal anymore with such level of ramification. Maintainability and support for an heterogeneous set of platforms were hard to ensure. Before the arrival of the new generation of machines the JLQCD group decided to rewrite from scratch a code for lattice gauge theory simulations using the already available codes in the market as a source of inspiration for best practices.
The essential design targets are:
• a complete abstraction of physical concepts from the actual implementation;
• a user-level that does not need a single line of programming to change the code behavior;
• maintainability of many code paths and sub-cases at the developer-level;
• support for multiple platforms and optimization on some specific architectures.
The object orientation paradigm fits nicely such a complexity and, although creating initial performance issues, the C++ language was a natural choice for its flexibility. We could achieve the desired functionality using the design patterns programming paradigm [1] influenced by the Chroma code-set [2] , but keeping code complexity at a manageable level.
The structure is layered. The communication level (could be single core, MPI, BGNET) is the lowest one. Several basic concepts are defined, like a Field (just a container) and Formats (that define the content of a Field), up to the definitions of Dirac operators, Actions, Solvers, Trajectories, and so on, as abstract concepts. Some libraries come with an architecture dependent optimized implementation, easily extendable to future designs.
The code is now mature, it is actually used intensively for configuration generation and analysis in our current N f = 2 + 1 Domain-Wall fermion runs, and it has been tested on various platforms including UNIX PC (single/multi-core), Mac, Blue Gene/Q, Hitachi SR16K.
The performance optimization efforts were concentrated so far on the IBM Blue Gene/Q architecture, the biggest installation at KEK in terms of computational power. Special libraries were developed by IBM Japan to speed-up communications, the Dirac operator kernels and the linear algebra [3] , see section 2 for more details. On top of that we implemented a full support for the highly optimized kernels of BFM [4] to further increase performance, see section 3. An improvement on the performances on the Hitachi SR16K (IBM Power7 architecture) is currently in progress.
The list of features of the current (October 2013) implementation includes:
• • Smearing (APE, Stout analytic [7] ), HMC runs with smeared fermionic actions
• Random Number Generators (Mersenne Twister [8] , Dynamic Creation Mersenne Twister [8] )
• Full I/O support for plain ASCII and plain binary, just reading support for ILDG, NERSC, MILC, JLQCD-legacy type of configurations
• User-level XML control of program behavior IroIro++ is currently under active development and the list of supported features is changing month by month. Several things are still to be completed before a full public release. The code still lacks an organized and robust test suite that is necessary to keep track efficiently of regression bugs. This is a point to be addressed in the near future. Unfortunately documentation is still limited to the Doxygen files and some basic information on compilation and running simple jobs 1 .
In the following sections of this report we discuss the performance optimization related to the BG/Q architecture, first reviewing the libraries by IBM Japan and then the BFM integration. In the final section we present some performance figures comparing several setups.
Blue Gene/Q and IBM libraries
The Blue Gene/Q is the third in line of the IBM Blue Gene series. It was designed for high scalability and excellent performance per watt which makes it currently one of the best choices for HPC. Every board contains 32 chips running at 1.6 GHz and communicating through a metal plane. The nodes communicate through each other through a 5d network of optical cables. Every chip has 16+1+1 PowerPC A2 processor cores and applications can use 16 of them. Each processor supports 4-way Simultaneous Multi-Threading (SMT) for up to 64 concurring threads, and has also a 4-way SIMD FPU that implements fused multiply-add operations (FMA), thus giving a theoretical peak performance of 204.8 GFlops. Table 1 summarizes the technical details and also the hierarchical memory structure, emphasizing that a network bandwidth close to the DDR memory bandwidth guarantees the scalability up to sizes of 120 racks (more than 2 million cores, at LLNL). 
Communication and threading: the BGNET and the BGQThreads libraries
IBM Japan developed a series of libraries (BGNET, BGWilson, BGQThreads) to exploit the computational capabilities of BG/Q. The BGNET library is a low latency, light-weight API with smaller overhead for small data packets than MPI. It is programmed using the System Programming Interface (SPI) APIs by IBM and it uses the remote DMA (rDMA) capabilities of communication between compute nodes on BG/Q. It supports asynchronous send and receive, small overhead barriers, global sum and reduce, broadcasts and more. IroIro++ fully supports this communication layer, as well as plain MPI, which is slower. The BGQThreads library is an SMP threading library that has lower latencies for barriers than the standard openMP functions.
Kernels
The kernels written by J. Doi [3] use an array of structures memory pattern for the field storage which has several advantages for cache memory access although it is not portable. BGWilson is a library for the Wilson Dirac matrix-vector multiplication (in chiral and Dirac representations) that is written in C using intrinsics for SIMD data loading and processing, optimization of the pipeline by assembly code, threading to hide latencies, concurrent data transfer and calculations and even-odd preconditioning. The library is extended to include full CG solvers for the Dirac operator. All the 5D Domain Wall kernels and solvers are optimized using functions from this library, minimizing the creations of temporary objects that typically affects performance on C++ codes. The measured performances for this library are about 65 GFlops per node for BiCG Wilson solver (see figure 1 for a weak scaling test) and 25 GFlops per node for 5D Domain Wall fermion solver (in double precision). Since using a specialized solver breaks the rule of having distinct objects for different concepts (the Dirac operator and the solvers) we had to create special solvers classes that are used when the improved libraries are called. This translates at user-level in a simple change in the name of the fermionic action.
Besides the optimization of the Dirac kernels all the linear algebra for matrix-matrix, matrixvector, vector-vector operations are threaded and use SIMD functionalities. They can be called from within an openMP parallel section dividing the workload in chunks to be processed by each single thread. In the current version of the code all the computational steps of the (R)HMC code use the hybrid MPI-openMP threading model, using all the level of parallelism that BG/Q provides (SIMD, in core-threading, inter-core communications). In relation to performance we examine the balance between the number of MPI processes and the openMP threads in section 4. 
BFM: Bagel Fermion Matrix library integration
Integrating the Bagel Fermion sparse-Matrix (BFM) library by P. Boyle [4] led to another boost in the code efficiency. The Bagel package is a domain specific compiler targeted to several architectures that have been optimized for all the BlueGene series microprocessors (L, P, Q). It generates assembler code for the core routines of a QCD calculation starting from a table of mappings from abstract primitive operations to the actual architecture dependent instructions. Then the code performs efficient prefetching of data streams and rescheduling of the issued instructions. The low level assembly code is then used by the BFM package that provides a useful interface for complex code-sets like IroIro++ or Chroma. BFM contains utilities for memory management to allocate, import/export into the BFM memory storage pattern the fermion and gauge fields. Thus before using any function in BFM we wrote an interface layer that translates the conventions in the memory storage and fermions representation between IroIro++ and BFM (the current overhead for this step is about 1%). As the BGWilson library, BFM contains already optimized CG solvers for Wilson and 5D Domain-Wall fermions. For communications, it relies on external libraries, in this case the BGNET. The scheme of dependencies of the IroIro++ code on the BFM library is depicted in figure 2 . 
Performances and conclusions
In this section we report the current status regarding performances of IroIro++ in generating configurations using the (R)HMC algorithm by changing the balance between the number of threads and the number of MPI processes on BG/Q. We show in table 2 and figure 3 the results of a set of tests: two flavors of Domain Wall fermions (Scaled-Shamir kernel) keeping the global lattice size constant to 16 3 × 32 and running on the smaller partition, 32 nodes. The product of MPI processes times the number of threads is kept constant to 64. In the last five columns we report the breakdown of the timings relative to the single sections of the HMC algorithm. The Gflops/node figure includes everything, code initialization included, so it is underestimated for such small number of trajectories. The parameters have not been chosen to get the maximum sustained performance that is around 30 GFlops/s per node. Table 2 : Current running time (sec) and performance (GFlops per node) in a run using BFM on a 32 node partition, no fiber optics communications, using node maps. Global lattice size 16 3 × 32, the mass am = 0.009, the fifth dimension L s = 12. The 4th column show the time spent in CG relative to the whole HMC trajectory. UpdateU stands for the exponentiation part timing in seconds, CalcH for the calculation of the action at the beginning and at the end of the trajectory, Init for the random numbers and gauge initialization and P-# are the force calculation timings for the fermionic and the gauge part respectively. Summing-up all these contribution gives the total running time, that is for two trajectories with τ = 1 and 8 steps of molecular dynamics. Gauge is integrated in a 4 times finer interval.
In table 3 a more realistic RHMC run on a 2048 nodes partitions is presented. The lattice is now 32 3 ×64, simulating 2+1 flavors (m u.d = 0.007, m s = 0.04) with a Hasenbusch preconditioning step (am = 0.04). The force terms are now labeling the preconditioner, the strange quark, the light quarks, and the gauge part respectively. The performance here suffer from the communications and synchronization among many cores and the incomplete threading of some parts of the code: the best overall is ∼15 GFlops/node (that drops to 13 GFlops/node if I/O is taken into account), while the CG alone is over 20 GFlops/node. The sweetest point for CG is always when the number of threads is maximum, so we can avoid some of the expensive MPI communications and barriers. Most of the code is correctly threaded, showing almost no dependence, or a mild one, on the number of threads. The missing part is the force calculation that has a severe impact on the final performance, especially when the rational approximation of the force terms is used for the strange quark (P-1 in table 3 ). This improvement is currently under development at the time of writing.
In conclusion, we presented our efforts in writing an efficient but easily extendable code for lattice simulations on big scale supercomputers. The code needs few more optimization to reach the optimal performance for every case (the target being the BFM CG efficiency). The IroIro++ code is now almost ready for being published online, as soon as sufficient documentation is written to support new users.
