12,893 research outputs found

    C programs for solving the time-dependent Gross-Pitaevskii equation in a fully anisotropic trap

    Full text link
    We present C programming language versions of earlier published Fortran programs (Muruganandam and Adhikari, Comput. Phys. Commun. 180 (2009) 1888) for calculating both stationary and non-stationary solutions of the time-dependent Gross-Pitaevskii (GP) equation. The GP equation describes the properties of dilute Bose-Einstein condensates at ultra-cold temperatures. C versions of programs use the same algorithms as the Fortran ones, involving real- and imaginary-time propagation based on a split-step Crank-Nicolson method. In a one-space-variable form of the GP equation, we consider the one-dimensional, two-dimensional, circularly-symmetric, and the three-dimensional spherically-symmetric harmonic-oscillator traps. In the two-space-variable form, we consider the GP equation in two-dimensional anisotropic and three-dimensional axially-symmetric traps. The fully-anisotropic three-dimensional GP equation is also considered. In addition to these twelve programs, for six algorithms that involve two and three space variables, we have also developed threaded (OpenMP parallelized) programs, which allow numerical simulations to use all available CPU cores on a computer. All 18 programs are optimized and accompanied by makefiles for several popular C compilers. We present typical results for scalability of threaded codes and demonstrate almost linear speedup obtained with the new programs, allowing a decrease in execution times by an order of magnitude on modern multi-core computers.Comment: 8 pages, 1 figure; 18 C programs included (to download, click other and download the source

    Simple and Effective Type Check Removal through Lazy Basic Block Versioning

    Get PDF
    Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language VM implementations must attempt to eliminate redundant dynamic type checks. However, type inference analyses are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures. This paper introduces lazy basic block versioning, a simple JIT compilation technique which effectively removes redundant type checks from critical code paths. This novel approach lazily generates type-specialized versions of basic blocks on-the-fly while propagating context-dependent type information. This does not require the use of costly program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques. We have implemented intraprocedural lazy basic block versioning in a JavaScript JIT compiler. This approach is compared with a classical flow-based type analysis. Lazy basic block versioning performs as well or better on all benchmarks. On average, 71% of type tests are eliminated, yielding speedups of up to 50%. We also show that our implementation generates more efficient machine code than TraceMonkey, a tracing JIT compiler for JavaScript, on several benchmarks. The combination of implementation simplicity, low algorithmic complexity and good run time performance makes basic block versioning attractive for baseline JIT compilers

    Importance of Explicit Vectorization for CPU and GPU Software Performance

    Full text link
    Much of the current focus in high-performance computing is on multi-threading, multi-computing, and graphics processing unit (GPU) computing. However, vectorization and non-parallel optimization techniques, which can often be employed additionally, are less frequently discussed. In this paper, we present an analysis of several optimizations done on both central processing unit (CPU) and GPU implementations of a particular computationally intensive Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the equivalent, explicit memory coalescing, on the GPU are found to be critical to achieving good performance of this algorithm in both environments. The fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU version, in addition to speedup from multi-threading. This is 2x faster than the fully-optimized GPU version.Comment: 17 pages, 17 figure
    corecore