2,734 research outputs found
Computational methods and software systems for dynamics and control of large space structures
Two key areas of crucial importance to the computer-based simulation of large space structures are discussed. The first area involves multibody dynamics (MBD) of flexible space structures, with applications directed to deployment, construction, and maneuvering. The second area deals with advanced software systems, with emphasis on parallel processing. The latest research thrust in the second area involves massively parallel computers
Asynchronous and corrected-asynchronous numerical solutions of parabolic PDES on MIMD multiprocessors
A major problem in achieving significant speed-up on parallel machines is the overhead involved with synchronizing the concurrent process. Removing the synchronization constraint has the potential of speeding up the computation. The authors present asynchronous (AS) and corrected-asynchronous (CA) finite difference schemes for the multi-dimensional heat equation. Although the discussion concentrates on the Euler scheme for the solution of the heat equation, it has the potential for being extended to other schemes and other parabolic partial differential equations (PDEs). These schemes are analyzed and implemented on the shared memory multi-user Sequent Balance machine. Numerical results for one and two dimensional problems are presented. It is shown experimentally that the synchronization penalty can be about 50 percent of run time: in most cases, the asynchronous scheme runs twice as fast as the parallel synchronous scheme. In general, the efficiency of the parallel schemes increases with processor load, with the time level, and with the problem dimension. The efficiency of the AS may reach 90 percent and over, but it provides accurate results only for steady-state values. The CA, on the other hand, is less efficient, but provides more accurate results for intermediate (non steady-state) values
Alternating-Direction Line-Relaxation Methods on Multicomputers
We study the multicom.puter performance of a three-dimensional Navier–Stokes solver based on alternating-direction line-relaxation methods. We compare several multicomputer implementations, each of which combines a particular line-relaxation method and a particular distributed block-tridiagonal solver. In our experiments, the problem size was determined by resolution requirements of the application. As a result, the granularity of the computations of our study is finer than is customary in the performance analysis of concurrent block-tridiagonal solvers. Our best results were obtained with a modified half-Gauss–Seidel line-relaxation method implemented by means of a new iterative block-tridiagonal solver that is developed here. Most computations were performed on the Intel Touchstone Delta, but we also used the Intel Paragon XP/S, the Parsytec SC-256, and the Fujitsu S-600 for comparison
A simple parallel prefix algorithm for compact finite-difference schemes
A compact scheme is a discretization scheme that is advantageous in obtaining highly accurate solutions. However, the resulting systems from compact schemes are tridiagonal systems that are difficult to solve efficiently on parallel computers. Considering the almost symmetric Toeplitz structure, a parallel algorithm, simple parallel prefix (SPP), is proposed. The SPP algorithm requires less memory than the conventional LU decomposition and is highly efficient on parallel machines. It consists of a prefix communication pattern and AXPY operations. Both the computation and the communication can be truncated without degrading the accuracy when the system is diagonally dominant. A formal accuracy study was conducted to provide a simple truncation formula. Experimental results were measured on a MasPar MP-1 SIMD machine and on a Cray 2 vector machine. Experimental results show that the simple parallel prefix algorithm is a good algorithm for the compact scheme on high-performance computers
STOCHSIMGPU Parallel stochastic simulation for the Systems\ud Biology Toolbox 2 for MATLAB
Motivation: The importance of stochasticity in biological systems is becoming increasingly recognised and the computational cost of biologically realistic stochastic simulations urgently requires development of efficient software. We present a new software tool STOCHSIMGPU which exploits graphics processing units (GPUs)for parallel stochastic simulations of biological/chemical reaction systems and show that significant gains in efficiency can be made. It is integrated into MATLAB and works with the Systems Biology Toolbox 2 (SBTOOLBOX2) for MATLAB.\ud
\ud
Results: The GPU-based parallel implementation of the Gillespie stochastic simulation algorithm (SSA), the logarithmic direct method (LDM), and the next reaction method (NRM) is approximately 85 times faster than the sequential implementation of the NRM on a central processing unit (CPU). Using our software does not require any changes to the user’s models, since it acts as a direct replacement of the stochastic simulation software of the SBTOOLBOX2
Hierarchical fractional-step approximations and parallel kinetic Monte Carlo algorithms
We present a mathematical framework for constructing and analyzing parallel
algorithms for lattice Kinetic Monte Carlo (KMC) simulations. The resulting
algorithms have the capacity to simulate a wide range of spatio-temporal scales
in spatially distributed, non-equilibrium physiochemical processes with complex
chemistry and transport micro-mechanisms. The algorithms can be tailored to
specific hierarchical parallel architectures such as multi-core processors or
clusters of Graphical Processing Units (GPUs). The proposed parallel algorithms
are controlled-error approximations of kinetic Monte Carlo algorithms,
departing from the predominant paradigm of creating parallel KMC algorithms
with exactly the same master equation as the serial one.
Our methodology relies on a spatial decomposition of the Markov operator
underlying the KMC algorithm into a hierarchy of operators corresponding to the
processors' structure in the parallel architecture. Based on this operator
decomposition, we formulate Fractional Step Approximation schemes by employing
the Trotter Theorem and its random variants; these schemes, (a) determine the
communication schedule} between processors, and (b) are run independently on
each processor through a serial KMC simulation, called a kernel, on each
fractional step time-window.
Furthermore, the proposed mathematical framework allows us to rigorously
justify the numerical and statistical consistency of the proposed algorithms,
showing the convergence of our approximating schemes to the original serial
KMC. The approach also provides a systematic evaluation of different processor
communicating schedules.Comment: 34 pages, 9 figure
Solving Grid Equations Using the Alternating-triangular Method on a Graphics Accelerator
The paper describes a parallel-pipeline implementation of solving grid equations using the modified alternating-triangular iterative method (MATM), obtained by numerically solving the equations of mathematical physics. The greatest computational costs at using this method are on the stages of solving a system of linear algebraic equations (SLAE) with lower triangular and upper non-triangular matrices. An algorithm for solving the SLAE with a lower triangular matrix on a graphics accelerator using NVIDIA CUDA technology is presented. To implement the parallel-pipeline method, a three-dimensional decomposition of the computational domain was used. It is divided into blocks along the y coordinate, the number of which corresponds to the number of GPU streaming multiprocessors involved in the calculations. In turn, the blocks are divided into fragments according to two spatial coordinates — x and z. The presented graph model describes the relationship between adjacent fragments of the computational grid and the pipeline calculation process. Based on the results of computational experiments, a regression model was obtained that describes the dependence of the time for calculation one MATM step on the GPU, the acceleration and efficiency for SLAE solution with a lower triangular matrix by the parallel-pipeline method on the GPU were calculated using the different number of streaming multiprocessors.The paper describes a parallel-pipeline implementation of solving grid equations using the modified alternating-triangular iterative method (MATM), obtained by numerically solving the equations of mathematical physics. The greatest computational costs at using this method are on the stages of solving a system of linear algebraic equations (SLAE) with lower triangular and upper non-triangular matrices. An algorithm for solving the SLAE with a lower triangular matrix on a graphics accelerator using NVIDIA CUDA technology is presented. To implement the parallel-pipeline method, a three-dimensional decomposition of the computational domain was used. It is divided into blocks along the y coordinate, the number of which corresponds to the number of GPU streaming multiprocessors involved in the calculations. In turn, the blocks are divided into fragments according to two spatial coordinates — x and z. The presented graph model describes the relationship between adjacent fragments of the computational grid and the pipeline calculation process. Based on the results of computational experiments, a regression model was obtained that describes the dependence of the time for calculation one MATM step on the GPU, the acceleration and efficiency for SLAE solution with a lower triangular matrix by the parallel-pipeline method on the GPU were calculated using the different number of streaming multiprocessors
Optimal pre-scheduling of problem remappings
A large class of scientific computational problems can be characterized as a sequence of steps where a significant amount of computation occurs each step, but the work performed at each step is not necessarily identical. Two good examples of this type of computation are: (1) regridding methods which change the problem discretization during the course of the computation, and (2) methods for solving sparse triangular systems of linear equations. Recent work has investigated a means of mapping such computations onto parallel processors; the method defines a family of static mappings with differing degrees of importance placed on the conflicting goals of good load balance and low communication/synchronization overhead. The performance tradeoffs are controllable by adjusting the parameters of the mapping method. To achieve good performance it may be necessary to dynamically change these parameters at run-time, but such changes can impose additional costs. If the computation's behavior can be determined prior to its execution, it can be possible to construct an optimal parameter schedule using a low-order-polynomial-time dynamic programming algorithm. Since the latter can be expensive, the performance is studied of the effect of a linear-time scheduling heuristic on one of the model problems, and it is shown to be effective and nearly optimal
- …