15,859 research outputs found
Parallel and Distributed Simulation from Many Cores to the Public Cloud (Extended Version)
In this tutorial paper, we will firstly review some basic simulation concepts
and then introduce the parallel and distributed simulation techniques in view
of some new challenges of today and tomorrow. More in particular, in the last
years there has been a wide diffusion of many cores architectures and we can
expect this trend to continue. On the other hand, the success of cloud
computing is strongly promoting the everything as a service paradigm. Is
parallel and distributed simulation ready for these new challenges? The current
approaches present many limitations in terms of usability and adaptivity: there
is a strong need for new evaluation metrics and for revising the currently
implemented mechanisms. In the last part of the paper, we propose a new
approach based on multi-agent systems for the simulation of complex systems. It
is possible to implement advanced techniques such as the migration of simulated
entities in order to build mechanisms that are both adaptive and very easy to
use. Adaptive mechanisms are able to significantly reduce the communication
cost in the parallel/distributed architectures, to implement load-balance
techniques and to cope with execution environments that are both variable and
dynamic. Finally, such mechanisms will be used to build simulations on top of
unreliable cloud services.Comment: Tutorial paper published in the Proceedings of the International
Conference on High Performance Computing and Simulation (HPCS 2011). Istanbul
(Turkey), IEEE, July 2011. ISBN 978-1-61284-382-
HeTM: Transactional Memory for Heterogeneous Systems
Modern heterogeneous computing architectures, which couple multi-core CPUs
with discrete many-core GPUs (or other specialized hardware accelerators),
enable unprecedented peak performance and energy efficiency levels.
Unfortunately, though, developing applications that can take full advantage of
the potential of heterogeneous systems is a notoriously hard task. This work
takes a step towards reducing the complexity of programming heterogeneous
systems by introducing the abstraction of Heterogeneous Transactional Memory
(HeTM). HeTM provides programmers with the illusion of a single memory region,
shared among the CPUs and the (discrete) GPU(s) of a heterogeneous system, with
support for atomic transactions. Besides introducing the abstract semantics and
programming model of HeTM, we present the design and evaluation of a concrete
implementation of the proposed abstraction, which we named Speculative HeTM
(SHeTM). SHeTM makes use of a novel design that leverages on speculative
techniques and aims at hiding the inherently large communication latency
between CPUs and discrete GPUs and at minimizing inter-device synchronization
overhead. SHeTM is based on a modular and extensible design that allows for
easily integrating alternative TM implementations on the CPU's and GPU's sides,
which allows the flexibility to adopt, on either side, the TM implementation
(e.g., in hardware or software) that best fits the applications' workload and
the architectural characteristics of the processing unit. We demonstrate the
efficiency of the SHeTM via an extensive quantitative study based both on
synthetic benchmarks and on a porting of a popular object caching system.Comment: The current work was accepted in the 28th International Conference on
Parallel Architectures and Compilation Techniques (PACT'19
Multicore-aware parallel temporal blocking of stencil codes for shared and distributed memory
New algorithms and optimization techniques are needed to balance the
accelerating trend towards bandwidth-starved multicore chips. It is well known
that the performance of stencil codes can be improved by temporal blocking,
lessening the pressure on the memory interface. We introduce a new pipelined
approach that makes explicit use of shared caches in multicore environments and
minimizes synchronization and boundary overhead. For clusters of shared-memory
nodes we demonstrate how temporal blocking can be employed successfully in a
hybrid shared/distributed-memory environment.Comment: 9 pages, 6 figure
Architectural support for task dependence management with flexible software scheduling
The growing complexity of multi-core architectures has motivated a wide range of software mechanisms to improve the orchestration of parallel executions. Task parallelism has become a very attractive approach thanks to its programmability, portability and potential for optimizations. However, with the expected increase in core counts, finer-grained tasking will be required to exploit the available parallelism, which will increase the overheads introduced by the runtime system. This work presents Task Dependence Manager (TDM), a hardware/software co-designed mechanism to mitigate runtime system overheads. TDM introduces a hardware unit, denoted Dependence Management Unit (DMU), and minimal ISA extensions that allow the runtime system to offload costly dependence tracking operations to the DMU and to still perform task scheduling in software. With lower hardware cost, TDM outperforms hardware-based solutions and enhances the flexibility, adaptability and composability of the system. Results show that TDM improves performance by 12.3% and reduces EDP by 20.4% on average with respect to a software runtime system. Compared to a runtime system fully implemented in hardware, TDM achieves an average speedup of 4.2% with 7.3x less area requirements and significant EDP reductions. In addition, five different software schedulers are evaluated with TDM, illustrating its flexibility and performance gains.This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Science and
Innovation (contracts TIN2015-65316-P, TIN2016-76635-C2-2-R and TIN2016-81840-REDT), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), and by the European Union’s Horizon 2020 research and innovation programme under grant agreement No 671697 and No. 671610. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047.Peer ReviewedPostprint (author's final draft
Parallel Discrete Event Simulation with Erlang
Discrete Event Simulation (DES) is a widely used technique in which the state
of the simulator is updated by events happening at discrete points in time
(hence the name). DES is used to model and analyze many kinds of systems,
including computer architectures, communication networks, street traffic, and
others. Parallel and Distributed Simulation (PADS) aims at improving the
efficiency of DES by partitioning the simulation model across multiple
processing elements, in order to enabling larger and/or more detailed studies
to be carried out. The interest on PADS is increasing since the widespread
availability of multicore processors and affordable high performance computing
clusters. However, designing parallel simulation models requires considerable
expertise, the result being that PADS techniques are not as widespread as they
could be. In this paper we describe ErlangTW, a parallel simulation middleware
based on the Time Warp synchronization protocol. ErlangTW is entirely written
in Erlang, a concurrent, functional programming language specifically targeted
at building distributed systems. We argue that writing parallel simulation
models in Erlang is considerably easier than using conventional programming
languages. Moreover, ErlangTW allows simulation models to be executed either on
single-core, multicore and distributed computing architectures. We describe the
design and prototype implementation of ErlangTW, and report some preliminary
performance results on multicore and distributed architectures using the well
known PHOLD benchmark.Comment: Proceedings of ACM SIGPLAN Workshop on Functional High-Performance
Computing (FHPC 2012) in conjunction with ICFP 2012. ISBN: 978-1-4503-1577-
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
Assessing load-sharing within optimistic simulation platforms
The advent of multi-core machines has lead to the need for revising the architecture of modern simulation platforms. One recent proposal we made attempted to explore the viability of load-sharing for optimistic simulators run on top of these types of machines. In this article, we provide an extensive experimental study for an assessment of the effects on run-time dynamics by a load-sharing architecture that has been implemented within the ROOT-Sim package, namely an open source simulation platform adhering to the optimistic synchronization paradigm. This experimental study is essentially aimed at evaluating possible sources of overheads when supporting load-sharing. It has been based on differentiated workloads allowing us to generate different execution profiles in terms of, e.g., granularity/locality of the simulation events. © 2012 IEEE
- …