12,454 research outputs found
Parallel implementation of the TRANSIMS micro-simulation
This paper describes the parallel implementation of the TRANSIMS traffic
micro-simulation. The parallelization method is domain decomposition, which
means that each CPU of the parallel computer is responsible for a different
geographical area of the simulated region. We describe how information between
domains is exchanged, and how the transportation network graph is partitioned.
An adaptive scheme is used to optimize load balancing. We then demonstrate how
computing speeds of our parallel micro-simulations can be systematically
predicted once the scenario and the computer architecture are known. This makes
it possible, for example, to decide if a certain study is feasible with a
certain computing budget, and how to invest that budget. The main ingredients
of the prediction are knowledge about the parallel implementation of the
micro-simulation, knowledge about the characteristics of the partitioning of
the transportation network graph, and knowledge about the interaction of these
quantities with the computer system. In particular, we investigate the
differences between switched and non-switched topologies, and the effects of 10
Mbit, 100 Mbit, and Gbit Ethernet. keywords: Traffic simulation, parallel
computing, transportation planning, TRANSIM
Achieving Efficient Strong Scaling with PETSc using Hybrid MPI/OpenMP Optimisation
The increasing number of processing elements and decreas- ing memory to core
ratio in modern high-performance platforms makes efficient strong scaling a key
requirement for numerical algorithms. In order to achieve efficient scalability
on massively parallel systems scientific software must evolve across the entire
stack to exploit the multiple levels of parallelism exposed in modern
architectures. In this paper we demonstrate the use of hybrid MPI/OpenMP
parallelisation to optimise parallel sparse matrix-vector multiplication in
PETSc, a widely used scientific library for the scalable solution of partial
differential equations. Using large matrices generated by Fluidity, an open
source CFD application code which uses PETSc as its linear solver engine, we
evaluate the effect of explicit communication overlap using task-based
parallelism and show how to further improve performance by explicitly load
balancing threads within MPI processes. We demonstrate a significant speedup
over the pure-MPI mode and efficient strong scaling of sparse matrix-vector
multiplication on Fujitsu PRIMEHPC FX10 and Cray XE6 systems
One machine, one minute, three billion tetrahedra
This paper presents a new scalable parallelization scheme to generate the 3D
Delaunay triangulation of a given set of points. Our first contribution is an
efficient serial implementation of the incremental Delaunay insertion
algorithm. A simple dedicated data structure, an efficient sorting of the
points and the optimization of the insertion algorithm have permitted to
accelerate reference implementations by a factor three. Our second contribution
is a multi-threaded version of the Delaunay kernel that is able to concurrently
insert vertices. Moore curve coordinates are used to partition the point set,
avoiding heavy synchronization overheads. Conflicts are managed by modifying
the partitions with a simple rescaling of the space-filling curve. The
performances of our implementation have been measured on three different
processors, an Intel core-i7, an Intel Xeon Phi and an AMD EPYC, on which we
have been able to compute 3 billion tetrahedra in 53 seconds. This corresponds
to a generation rate of over 55 million tetrahedra per second. We finally show
how this very efficient parallel Delaunay triangulation can be integrated in a
Delaunay refinement mesh generator which takes as input the triangulated
surface boundary of the volume to mesh
A Functional Architecture Approach to Neural Systems
The technology for the design of systems to perform extremely complex combinations of real-time functionality has developed over a long period. This technology is based on the use of a hardware architecture with a physical separation into memory and processing, and a software architecture which divides functionality into a disciplined hierarchy of software components which exchange unambiguous information. This technology experiences difficulty in design of systems to perform parallel processing, and extreme difficulty in design of systems which can heuristically change their own functionality. These limitations derive from the approach to information exchange between functional components. A design approach in which functional components can exchange ambiguous information leads to systems with the recommendation architecture which are less subject to these limitations. Biological brains have been constrained by natural pressures to adopt functional architectures with this different information exchange approach. Neural networks have not made a complete shift to use of ambiguous information, and do not address adequate management of context for ambiguous information exchange between modules. As a result such networks cannot be scaled to complex functionality. Simulations of systems with the recommendation architecture demonstrate the capability to heuristically organize to perform complex functionality
Parallel processors and nonlinear structural dynamics algorithms and software
The adaptation of a finite element program with explicit time integration to a massively parallel SIMD (single instruction multiple data) computer, the CONNECTION Machine is described. The adaptation required the development of a new algorithm, called the exchange algorithm, in which all nodal variables are allocated to the element with an exchange of nodal forces at each time step. The architectural and C* programming language features of the CONNECTION Machine are also summarized. Various alternate data structures and associated algorithms for nonlinear finite element analysis are discussed and compared. Results are presented which demonstrate that the CONNECTION Machine is capable of outperforming the CRAY XMP/14
Empowering parallel computing with field programmable gate arrays
After more than 30 years, reconfigurable computing has grown from a concept to a mature field of science and technology. The cornerstone of this evolution is the field programmable gate array, a building block enabling the configuration of a custom hardware architecture. The departure from static von Neumannlike architectures opens the way to eliminate the instruction overhead and to optimize the execution speed and power consumption. FPGAs now live in a growing ecosystem of development tools, enabling software programmers to map algorithms directly onto hardware. Applications abound in many directions, including data centers, IoT, AI, image processing and space exploration. The increasing success of FPGAs is largely due to an improved toolchain with solid high-level synthesis support as well as a better integration with processor and memory systems. On the other hand, long compile times and complex design exploration remain areas for improvement. In this paper we address the evolution of FPGAs towards advanced multi-functional accelerators, discuss different programming models and their HLS language implementations, as well as high-performance tuning of FPGAs integrated into a heterogeneous platform. We pinpoint fallacies and pitfalls, and identify opportunities for language enhancements and architectural refinements
Distributed memory compiler methods for irregular problems: Data copy reuse and runtime partitioning
Outlined here are two methods which we believe will play an important role in any distributed memory compiler able to handle sparse and unstructured problems. We describe how to link runtime partitioners to distributed memory compilers. In our scheme, programmers can implicitly specify how data and loop iterations are to be distributed between processors. This insulates users from having to deal explicitly with potentially complex algorithms that carry out work and data partitioning. We also describe a viable mechanism for tracking and reusing copies of off-processor data. In many programs, several loops access the same off-processor memory locations. As long as it can be verified that the values assigned to off-processor memory locations remain unmodified, we show that we can effectively reuse stored off-processor data. We present experimental data from a 3-D unstructured Euler solver run on iPSC/860 to demonstrate the usefulness of our methods
- …