12,893 research outputs found
C programs for solving the time-dependent Gross-Pitaevskii equation in a fully anisotropic trap
We present C programming language versions of earlier published Fortran
programs (Muruganandam and Adhikari, Comput. Phys. Commun. 180 (2009) 1888) for
calculating both stationary and non-stationary solutions of the time-dependent
Gross-Pitaevskii (GP) equation. The GP equation describes the properties of
dilute Bose-Einstein condensates at ultra-cold temperatures. C versions of
programs use the same algorithms as the Fortran ones, involving real- and
imaginary-time propagation based on a split-step Crank-Nicolson method. In a
one-space-variable form of the GP equation, we consider the one-dimensional,
two-dimensional, circularly-symmetric, and the three-dimensional
spherically-symmetric harmonic-oscillator traps. In the two-space-variable
form, we consider the GP equation in two-dimensional anisotropic and
three-dimensional axially-symmetric traps. The fully-anisotropic
three-dimensional GP equation is also considered. In addition to these twelve
programs, for six algorithms that involve two and three space variables, we
have also developed threaded (OpenMP parallelized) programs, which allow
numerical simulations to use all available CPU cores on a computer. All 18
programs are optimized and accompanied by makefiles for several popular C
compilers. We present typical results for scalability of threaded codes and
demonstrate almost linear speedup obtained with the new programs, allowing a
decrease in execution times by an order of magnitude on modern multi-core
computers.Comment: 8 pages, 1 figure; 18 C programs included (to download, click other
and download the source
Simple and Effective Type Check Removal through Lazy Basic Block Versioning
Dynamically typed programming languages such as JavaScript and Python defer
type checking to run time. In order to maximize performance, dynamic language
VM implementations must attempt to eliminate redundant dynamic type checks.
However, type inference analyses are often costly and involve tradeoffs between
compilation time and resulting precision. This has lead to the creation of
increasingly complex multi-tiered VM architectures.
This paper introduces lazy basic block versioning, a simple JIT compilation
technique which effectively removes redundant type checks from critical code
paths. This novel approach lazily generates type-specialized versions of basic
blocks on-the-fly while propagating context-dependent type information. This
does not require the use of costly program analyses, is not restricted by the
precision limitations of traditional type analyses and avoids the
implementation complexity of speculative optimization techniques.
We have implemented intraprocedural lazy basic block versioning in a
JavaScript JIT compiler. This approach is compared with a classical flow-based
type analysis. Lazy basic block versioning performs as well or better on all
benchmarks. On average, 71% of type tests are eliminated, yielding speedups of
up to 50%. We also show that our implementation generates more efficient
machine code than TraceMonkey, a tracing JIT compiler for JavaScript, on
several benchmarks. The combination of implementation simplicity, low
algorithmic complexity and good run time performance makes basic block
versioning attractive for baseline JIT compilers
Importance of Explicit Vectorization for CPU and GPU Software Performance
Much of the current focus in high-performance computing is on
multi-threading, multi-computing, and graphics processing unit (GPU) computing.
However, vectorization and non-parallel optimization techniques, which can
often be employed additionally, are less frequently discussed. In this paper,
we present an analysis of several optimizations done on both central processing
unit (CPU) and GPU implementations of a particular computationally intensive
Metropolis Monte Carlo algorithm. Explicit vectorization on the CPU and the
equivalent, explicit memory coalescing, on the GPU are found to be critical to
achieving good performance of this algorithm in both environments. The
fully-optimized CPU version achieves a 9x to 12x speedup over the original CPU
version, in addition to speedup from multi-threading. This is 2x faster than
the fully-optimized GPU version.Comment: 17 pages, 17 figure
- …