We present a finite difference numerical algorithm for solving two dimensional spatially homogeneous Boltzmann transport equation which describes electron transport in a semiconductor superlattice subject to crossed time dependent electric and constant magnetic fields. The algorithm is implemented both in C language targeted to CPU and in CUDA C language targeted to commodity NVidia GPU. We compare performances and merits of one implementation versus another and discuss various software optimization techniques.
Introduction
Numerical solutions of Boltzmann Transport Equation (BTE) are of utmost importance in the modern physics, especially in the research areas of fluid dynamics and semi-classical description of quantummechanical systems. BTE is often solved using Monte-Carlo method [1] . The other approach is Lattice Boltzmann Method, it is more recent and very promising [2] . Due to its numerical stability and explicit nature, Lattice Boltzmann Method lends itself quite well to the parallel implementations on Graphical Processing Units (GPU) [3, 4, 5, 6] . Finite Difference Method (FDM) is the simplest approach to the solution of BTE. However, to attain desirable numerical stability it often requires fully implicit formulation. Recently, a number of new advanced FDMs were developed. In [7] a variant of FDM is combined with Monte-Carlo method. Fully functional solver for PMOSFET devices, which among other things can utilize FDM for solving BTE, was presented in [8] . Numerical method for spatially non-homogeneous 1D BTE in application to semiconductor superlattices was recently considered in [9] .
In this work, we present a FDM method for solving two-dimensional BTE that describes a semiconductor superlattice (SL) subject to a time dependant electric field along the superlattice axis and a constant perpendicular magnetic field. SL and the configuration of applied fields are sketched in Fig.1 . This problem was earlier considered only in the limiting case of zero temperature [10] , while our numerical scheme and software that implements it, are used to analyse electron dynamics in SL at arbitrary temperatures. Superlattices are artificial periodic structures with periods not found in natural solids [12] . This relatively large period of SL results in a number of its unique physical features, which are interesting not only from the viewpoint of fundamental properties of solids, but also as tools in a realization of promising applications, including the generation and detection of terahertz radiation. Good overview of SL theory and basic experiments can be found in [11] .
Our numerical method combines Crank-Nicolson [13] and leap-frog algorithms. Leap-from algorithm is a variant of symplectic integrators, which are known to preserve area in the phase space and are unconditionally stable [14, 15] . We develop several implementations of our algorithm. One implementation uses C programming language and is targeted for CPU. Other implementations are written in CUDA and are targeted at NVidia GPU. Compute Unified Device Architecture, also known is CUDA, is parallel computing B E(t) x y z Figure 1 : Sketch of a superlattice under the action of time dependent electric E(t) and constant magnetic B fields. The electric field is directed along the superlattice axis (x-axis) and the perpendicular magnetic field is aligned along z-axis. Electron motion is considered in (x, y)-plane.
platform and C/C++ language extension for NVidia video cards. Different CUDA implementations of our numerical method primarily highlights various differences in memory access patterns, which are the most common bottlenecks for software running on video cards. We verified correctness of the method and its implementation by comparing our results of simulations with the available analytical results in the limiting case of zero temperature.
Physical model
Boltzmann equation governs a time evolution of the electron probability density function f (t, k, r). For our system it has the following form
where k is the crystal momentum, v(k) is the electron velocity and ε(k) is the energy dispersion relation for the lowest SL miniband [11] . We make several simplifications. Firstly, we assume that f is spatially homogeneous and put ∂f /∂r = 0. Secondly, the collision integral (∂f /∂t) st is taken in the most simplest form (f 0 − f )/τ , where f 0 is the equilibrium distribution function and τ is the constant relaxation time. We also limited our consideration to electron transport in a single miniband, which we describe within a tight binding approximation
where ∆ is the width of the miniband and m is the effective electron mass along SL layers. To make this system of equations (1) (2) and (3) dimensionless we make following substitutions.
And in view of geometry of our system ( Fig. 1 ) we obtain the following form of Boltzmann equation in dimensionless variables ∂f ∂t
The probability density function f (t, φ x , φ y ) can formally extend indefinitely along the y-axis, but practically this extention is always effectively limited by a relaxation to the equilibrium distribution f 0 (φ x , φ y ). In SL f (t, φ x , φ y ) is periodic along the x-axis with the period 2π. Therefore, φ x can be considered only within the first Brillouin zone defined from −π to π. The periodicity allows us to represent both f and f 0 in the form of Fourier series
where the Fourier coefficients a
n , a n and b n are functions of φ y and the last two are also functions of time. In the Fourier representation, Boltzmann equation (5) takes the form of an infinite set of the odinary differential equations
(10)
The equilibrium probability density function is assumed to be a temperature-dependent Boltzmann distribution, which with all normalization constants takes the form
Where I 0 (µ) is modified Bessel function of zeroes order. We can use this specific form of f 0 to find Fourier coefficients a
where I n (µ) are the modified Bessel functions of order n. Equations (8) and (9) do not preclude time dependency of both the electric E and magnetic B fields. However, keeping in line with the existing research in this field [10] , here we consider a constant B and the electric field that is a sum of a constant E dc and monochromatic ac E ω cos(ωt) components. Thus the total electric field acting on electrons in SL is E = E dc + E ω cos(ωt). We were most interested in property of absorption of ac component of electric field. When negative it indicates ac field amplification, paving the way to a lasing medium. We define absorption as a following expression time averaged over period 2π/ω.
where v dr is electron drift velocity along the x-axis.
∂ε ∂p x f (p x , p y )dp x dp y (17)
which in view of Fourier expansion of f (7) takes form
Independently of numerical method, norm of electron density function f has to always be equals to one, which in our coordinates takes form
Eq. (21) is later used as one of the tests of accuracy of our numerical method.
Numerical algorithm
Naive application of method of finite difference to (8) and (9) leads to either unstable and/or computationally intensive numerical system. To combat this problem we are using several methods at once. First, we discretize a n and b n along time and φ y axes.
and n is "harmonic number". This forms infinite two-dimensional grid. To be computable we limit it to n ∈ [0, . . . , N ] and m ∈ [0, . . . , M ], with following boundary conditions.
Both upper limits N and M have to be adjusted manually depending on strength of electric and magnetic fields and inverse temperature µ, which smears distribution function f in the phase space. Along the y-axis φ y is discretized with step ∆φ and it becomes function of lattice number m. We write two forms of equations (8) and (9) . One using forward differences and one using partial backward differences, i.e. on the right side of equal sign we are going to write partial derivatives at time t while everything else at time t + 1 and will follow standard procedure of Crank-Nicolson scheme by adding these two, forward and backward differences equations. First two equations (25) and (26) below are written in forward differencing scheme and last two (27) and (28) in backward differencing scheme
where
And application of Crank-Nicolson scheme [13] leads to
Equation (31, 32) can be formally written in the form
Where T is an operator that allows us to step from time step t to t + 1, separated by time interval ∆t. Using this operation as is leads to only conditionally stable numerical system, because A t n,m and B t n,m are taken at time t, i.e. partially this is still simple forward difference scheme. To combat this problem we introduce two staggered grids {z 0 , z 1 , ...} and {z 1/2 , z 3/2 , ...}, which we call whole and fractional one respectively. We then −π π φ x φ y Figure 2 : Example of result of numerical simulation showing transient response of electron probability density function (PDF) in the phase space (φx, φy) in response to applied electric and magnetic fields. Here electric field is applied alone the φx axis and magnetic field perpendicular to the plane of the plot. Initially PDF is concentrated around center of the plot according to Boltzmann distribution function (12) .
use leap frog method where to calculate z t+1 using (39) we use A t+1/2 and B t+1/2 computed on fractional grid. Similar operation is performed for a step from t + 1/2 to t + 3/2. Thus one step from t to t + 1 becomes two steps. This algorithm has to be started first by computing values of z 1/2 using (39) with time step ∆t/2. In Fig.2 you can see example of computing of probability density function using our algorithm. This figure shows heat-map plot of transient response of the system to applied electric and magnetic fields according to geometry as shown in Fig. 1 .
CUDA and GPU computing overview
Modern GPU differ from CPU in that they have thousands ALUs 1 at the expense of control hardware and large implicit caches of CPUs. In NVidia video cards these ALUs are known as "CUDA cores". They are grouped into rows and rows into larger units with control hardware and explicit caches. These units are known as Streaming Multiprocessors (SMX). In turn a single card often contains dozen of SMX units. Abundance of ALUs makes for a need of dedicated memory and wide memory bus directly on GPU. For example in GTX680 memory bandwidth is approximately 192GB/sec., while Intel Core i7 CPU with sandy bridge architecture provides only 37GB/sec of aggregate bandwidth. In general all of the ALUs in a video card can be executed in parallel, although in actuality their execution in SMXs is scheduled in groups of 32 threads known as warps. This very large degree of parallelism commonly leads to saturation of memory Figure 3 : Simplified logical scheme of host computer and GPU that highlights differences between two. Notable is abundance of ALUs in modern day GPU, which reaches into thousands and fast wide memory bus comparing to slow memory bus on the host computer.
GPU SMX
bus between on-board GPU memory and SMXs, which means that while programming for GPU significant speed enhancements can be made by optimising memory access patterns and using explicitly available caches [4, 16, 17] . Misaligned and uncoalesced memory access is much slower than indicated by maximum available bandwidth. Such access patterns are common problem points in CUDA programs. Thus in general CUDA software should try to minimize writing and reading to and from memory. It is common to refer to main computer as host and installed GPU as device. Simplified logical layout of host and GPU can be seen in Fig.  3 . Generally speaking GPU can be thought of as an explicitly programmable co-processor. White paper describing latest Kepler architecture of NVidia GPU can be found in [18] . CUDA is general purpose computing environment and extension to C and C++ languages developed by NVidia. It extends physical abstractions of GPU briefly described above and presents coherent API 2 for developing general purpose software [19] . Basic introduction to CUDA programming can be found in [20] and much more comprehensive one in [21] . Software written for a GPU always consist of two parts. One part that runs on the host computer, aka host code, and another part that runs on the device, aka kernel code. It is very common in one program to have several kernels executing in sequence or in parallel, later one is possible with CUDA streams. Kernels are implicitly loaded onto the device by CUDA runtime. They can access data structures stored on both, on-board device memory and much slower, but usually much larger host memory. To be placed on the on-board device memory, data structures have to be created first on the host computer and then explicitly loaded onto the device (GPU). Execution of a kernel happens in parallel up to the capacity of the device to do so. Each parallel flow of execution is known as a thread. Threads are organized in hierarchy of grid of blocks of threads. Threads within a block can share information through very fast shared memory. Significant amount of even faster register memory also available for each thread. Number of blocks and number of threads per block have upper limits. For GTX680 GPU grids containing blocks can be three-dimensional with maximum number of blocks 65535 × 65535 × 65535 and number of threads per blocks is limited to 1024 giving total number of threads an astonishing value of 2 58 . We can think of all of them as executing in parallel although in reality parallelism is ultimately limited by total number of available ALUs. At the simplest level CUDA programming could be understood as converting inner content of a loop into kernel code and replacing it with invocation of a kernel on the device. Inside of a kernel, index variable provided by a loop is replaced by a set of implicit variables indicating block number and thread number within a block. Together with dimensions of a block and grid they can be used to compute an equivalent of loop index. This is illustrated in the code snipped below, where only one of implicit variables threadIdx is shown. It defines position of thread within a block. 
. end
This kernel is later called with parameters indicating number of threads per block and number of blocks.
C and CUDA implementations
Using above mentioned algorithm two software packages were written. The C version that targets CPU and C/CUDA version for running on NVidia GPU. CUDA version was tested on consumer grade video card GTX680. Both implementations share the same memory layout. For C implementation memory layout is not important because computation is limited by speed of CPU. For CUDA version computation is limited by I/O speed between memory in a GPU and total number of available ALUs. Thus layout of arrays storing a (0) n,m , a n,m and b n,m and memory access patterns makes for the biggest difference in performance. We used a raw-major layout shown in Fig. 4 . To avoid divergent data flow at the boundaries we shift m index to the right and introduce zero cells along the perimeter or each array. These zero cells, highlighted in gray in Fig. 4 , ensure that boundary conditions (23) (24) are satisfied without use of if statements. Each thread computes next values of a and b for all of harmonic numbers for a given m value. In total we define 9 two-dimensional arrays. a where T(...) is implementation of operator (39). These code is repeated once more to compute a and b on fractional grid. There are several ways this can be transformed into a CUDA code. Two kernels are formed. One to move forward in time on the whole grid and another one for the fractional grid. Within each of the kernels several variants are possible. We identify each kernel as K x , where x is implementation number. The simplest one (K 1 ) is where one thread is allocated for each point of the grid. In K 2 (not shown below) we load a and b into shared array, which is first level of explicit cache in NVidia GPU. This way we can reduce memory access since nearby threads do access the same data structures. However, they share very little data and while benefits are noticeable they are not dramatic. Better and faster code is possible. In K 3 kernel shown above, each thread is responsible for computing a and b for all n. In this implementation we do not use shared memory buffer at all. Here instruction level parallelism within loop provides very big speed improvement comparing to kernels K 1 and K 2 . We can unroll loop to gain a bit more speed. In kernel K 4 loops are unrolled twice and in K 5 four times. We can also notice by looking at Fig. 4 that steps n and n+2 share a and b values at n+1. We take advantage of this in kernel K 5 , where we split each loop into two, each steeping over n with step 2, i.e. one loop with n=[0,2,4,...] and another one with n= [1,3,5,...] . In each loop we store a and b values at n+1 in registers and reuse them. This provides additional speed boost without any unrolling. Due to register pressure loop unrolling in K 6 does not provide any more speed gain and may even result in program becoming slower.
Results
We compare time needed to perform complete time evolution of f between all of the above mentioned implementations, CPU and 6 CUDA implementations. Also OpenMP 3 version of CPU implementation was tested. All code used float data type for storing a n,m and b n,m . We found that using double data type did not affect precision nor correspondence of results with known solutions. On the other hand GTX680, being consumer grade GPU, lucks in capability of performing calculations on double and it performance degrades noticeably when switching from float to double. Strait CPU implementation was single threaded and was tested on "Intel Core i7-3770" running at 3.4GHz. CUDA implementations were tested on NVidia GTX680 with 2GB or RAM. Results of each test case are presented in Table 1 . Test cases involved running each program 10 times and averaging resulted total run time. Actual run times are not important for they Impl: Fig. 4 . In K 4 main loops are unrolled twice and K 5 four times. In K 6 each loop over m is split in two, each steeping over 2 elements with lattice values reused in registers.
depend on passed parameters. Run time speed was compared against CPU implementation baseline and is presented as X times speed up. All CUDA implementations were tuned by varying block and grid sizes. It is common to measure lattice algorithms performance in Million Lattice Updates Per Second (MLUPS). That parameter is also shown. Note that in calculation of MLUPS we count update on movement only from time t to t + 1, i.e. on the whole grid only. Otherwise, if we include updates on the fractional grid, values of MLUPS would have to be doubled. Attained peak performance is 1094 MLUPS. Faster memory bandwidth and greater number of ALUs in later GPUs such as GTX-Titan (memory bandwidth 288GB/sec; 2688 CUDA cores) should give significantly higher peak value of MLUPS. For comparison, GTX680 used for this work has memory bandwidth of 192 GB/sec and 1536 CUDA cores (ALUs). On the example of our Boltzmann solver code you can see that even a consumer grade video card provides significant speed boost to computational tasks amenable to parallelisation. And if we take in the account low cost of such video cards, it is now possible to perform computations on the scale which just few years ago would require access to expensive supercomputers.
Conclusion
In this work we formulated numerical algorithm for solving Boltzmann transport equation applicable to semiconductor superlattice. Several different implementations of the algorithm were presented. One written in C for CPU and several for NVidia GPU using CUDA. We show that even in the most "naive" conversion of C to CUDA 16 fold speed performance is attained. Trying different optimization techniques discussed in this work CUDA code attains 118 fold speed improvement over the single threaded C code.
Acknowledgement
Author expressed gratitude to Kirill Alexseev and Jukka Isohätälä for very useful discussions while working on this paper.
