4,973 research outputs found
Parallel Sort-Based Matching for Data Distribution Management on Shared-Memory Multiprocessors
In this paper we consider the problem of identifying intersections between
two sets of d-dimensional axis-parallel rectangles. This is a common problem
that arises in many agent-based simulation studies, and is of central
importance in the context of High Level Architecture (HLA), where it is at the
core of the Data Distribution Management (DDM) service. Several realizations of
the DDM service have been proposed; however, many of them are either
inefficient or inherently sequential. These are serious limitations since
multicore processors are now ubiquitous, and DDM algorithms -- being
CPU-intensive -- could benefit from additional computing power. We propose a
parallel version of the Sort-Based Matching algorithm for shared-memory
multiprocessors. Sort-Based Matching is one of the most efficient serial
algorithms for the DDM problem, but is quite difficult to parallelize due to
data dependencies. We describe the algorithm and compute its asymptotic running
time; we complete the analysis by assessing its performance and scalability
through extensive experiments on two commodity multicore systems based on a
dual socket Intel Xeon processor, and a single socket Intel Core i7 processor.Comment: Proceedings of the 21-th ACM/IEEE International Symposium on
Distributed Simulation and Real Time Applications (DS-RT 2017). Best Paper
Award @DS-RT 201
The FORCE: A highly portable parallel programming language
Here, it is explained why the FORCE parallel programming language is easily portable among six different shared-memory microprocessors, and how a two-level macro preprocessor makes it possible to hide low level machine dependencies and to build machine-independent high level constructs on top of them. These FORCE constructs make it possible to write portable parallel programs largely independent of the number of processes and the specific shared memory multiprocessor executing them
Cache Equalizer: A Cache Pressure Aware Block Placement Scheme for Large-Scale Chip Multiprocessors
This paper describes Cache Equalizer (CE), a novel distributed cache management scheme for large scale chip multiprocessors (CMPs). Our work is motivated by large asymmetry in cache sets usages. CE decouples the physical locations of cache blocks from their addresses for the sake of reducing misses caused by destructive interferences. Temporal pressure at the on-chip last-level cache, is continuously collected at a group (comprised of cache sets) granularity, and periodically recorded at the memory controller to guide the placement process. An incoming block is consequently placed at a cache group that exhibits the minimum pressure. CE provides Quality of Service (QoS) by robustly offering better performance than the baseline shared NUCA cache. Simulation results using a full-system simulator demonstrate that CE outperforms shared NUCA caches by an average of 15.5% and by as much as 28.5% for the benchmark programs we examined. Furthermore, evaluations manifested the outperformance of CE versus related CMP cache designs
Shared versus distributed memory multiprocessors
The question of whether multiprocessors should have shared or distributed memory has attracted a great deal of attention. Some researchers argue strongly for building distributed memory machines, while others argue just as strongly for programming shared memory multiprocessors. A great deal of research is underway on both types of parallel systems. Special emphasis is placed on systems with a very large number of processors for computation intensive tasks and considers research and implementation trends. It appears that the two types of systems will likely converge to a common form for large scale multiprocessors
CoreTSAR: Task Scheduling for Accelerator-aware Runtimes
Heterogeneous supercomputers that incorporate computational accelerators
such as GPUs are increasingly popular due to their high
peak performance, energy efficiency and comparatively low cost.
Unfortunately, the programming models and frameworks designed
to extract performance from all computational units still lack the
flexibility of their CPU-only counterparts. Accelerated OpenMP
improves this situation by supporting natural migration of OpenMP
code from CPUs to a GPU. However, these implementations currently
lose one of OpenMP’s best features, its flexibility: typical
OpenMP applications can run on any number of CPUs. GPU implementations
do not transparently employ multiple GPUs on a node
or a mix of GPUs and CPUs. To address these shortcomings, we
present CoreTSAR, our runtime library for dynamically scheduling
tasks across heterogeneous resources, and propose straightforward
extensions that incorporate this functionality into Accelerated
OpenMP. We show that our approach can provide nearly linear
speedup to four GPUs over only using CPUs or one GPU while
increasing the overall flexibility of Accelerated OpenMP
Parallel processing and expert systems
Whether it be monitoring the thermal subsystem of Space Station Freedom, or controlling the navigation of the autonomous rover on Mars, NASA missions in the 1990s cannot enjoy an increased level of autonomy without the efficient implementation of expert systems. Merely increasing the computational speed of uniprocessors may not be able to guarantee that real-time demands are met for larger systems. Speedup via parallel processing must be pursued alongside the optimization of sequential implementations. Prototypes of parallel expert systems have been built at universities and industrial laboratories in the U.S. and Japan. The state-of-the-art research in progress related to parallel execution of expert systems is surveyed. The survey discusses multiprocessors for expert systems, parallel languages for symbolic computations, and mapping expert systems to multiprocessors. Results to date indicate that the parallelism achieved for these systems is small. The main reasons are (1) the body of knowledge applicable in any given situation and the amount of computation executed by each rule firing are small, (2) dividing the problem solving process into relatively independent partitions is difficult, and (3) implementation decisions that enable expert systems to be incrementally refined hamper compile-time optimization. In order to obtain greater speedups, data parallelism and application parallelism must be exploited
Simulation of 1+1 dimensional surface growth and lattices gases using GPUs
Restricted solid on solid surface growth models can be mapped onto binary
lattice gases. We show that efficient simulation algorithms can be realized on
GPUs either by CUDA or by OpenCL programming. We consider a
deposition/evaporation model following Kardar-Parisi-Zhang growth in 1+1
dimensions related to the Asymmetric Simple Exclusion Process and show that for
sizes, that fit into the shared memory of GPUs one can achieve the maximum
parallelization speedup ~ x100 for a Quadro FX 5800 graphics card with respect
to a single CPU of 2.67 GHz). This permits us to study the effect of quenched
columnar disorder, requiring extremely long simulation times. We compare the
CUDA realization with an OpenCL implementation designed for processor clusters
via MPI. A two-lane traffic model with randomized turning points is also
realized and the dynamical behavior has been investigated.Comment: 20 pages 12 figures, 1 table, to appear in Comp. Phys. Com
An exploration of CUDA and CBEA for a gravitational wave data-analysis application (Einstein@Home)
We present a detailed approach for making use of two new computer hardware
architectures -- CBEA and CUDA -- for accelerating a scientific data-analysis
application (Einstein@Home). Our results suggest that both the architectures
suit the application quite well and the achievable performance in the same
software developmental time-frame, is nearly identical.Comment: Accepted for publication in International Conference on Parallel
Processing and Applied Mathematics (PPAM 2009
- …