51,907 research outputs found
Developing EfïŹcient Discrete Simulations on Multicore and GPU Architectures
In this paper we show how to efïŹciently implement parallel discrete simulations on multicoreandGPUarchitecturesthrougharealexampleofanapplication: acellularautomatamodel of laser dynamics. We describe the techniques employed to build and optimize the implementations using OpenMP and CUDA frameworks. We have evaluated the performance on two different hardware platforms that represent different target market segments: high-end platforms for scientiïŹc computing, using an Intel Xeon Platinum 8259CL server with 48 cores, and also an NVIDIA Tesla V100GPU,bothrunningonAmazonWebServer(AWS)Cloud;and on a consumer-oriented platform, using an Intel Core i9 9900k CPU and an NVIDIA GeForce GTX 1050 TI GPU. Performance results were compared and analyzed in detail. We show that excellent performance and scalability can be obtained in both platforms, and we extract some important issues that imply a performance degradation for them. We also found that current multicore CPUs with large core numbers can bring a performance very near to that of GPUs, and even identical in some cases.Ministerio de EconomĂa, Industria y Competitividad, Gobierno de España (MINECO), and the Agencia Estatal de InvestigaciĂłn (AEI) of Spain, coïŹnanced by FEDER funds (EU) TIN2017-89842
GeantV: Results from the prototype of concurrent vector particle transport simulation in HEP
Full detector simulation was among the largest CPU consumer in all CERN
experiment software stacks for the first two runs of the Large Hadron Collider
(LHC). In the early 2010's, the projections were that simulation demands would
scale linearly with luminosity increase, compensated only partially by an
increase of computing resources. The extension of fast simulation approaches to
more use cases, covering a larger fraction of the simulation budget, is only
part of the solution due to intrinsic precision limitations. The remainder
corresponds to speeding-up the simulation software by several factors, which is
out of reach using simple optimizations on the current code base. In this
context, the GeantV R&D project was launched, aiming to redesign the legacy
particle transport codes in order to make them benefit from fine-grained
parallelism features such as vectorization, but also from increased code and
data locality. This paper presents extensively the results and achievements of
this R&D, as well as the conclusions and lessons learnt from the beta
prototype.Comment: 34 pages, 26 figures, 24 table
Quantum Monte Carlo for large chemical systems: Implementing efficient strategies for petascale platforms and beyond
Various strategies to implement efficiently QMC simulations for large
chemical systems are presented. These include: i.) the introduction of an
efficient algorithm to calculate the computationally expensive Slater matrices.
This novel scheme is based on the use of the highly localized character of
atomic Gaussian basis functions (not the molecular orbitals as usually done),
ii.) the possibility of keeping the memory footprint minimal, iii.) the
important enhancement of single-core performance when efficient optimization
tools are employed, and iv.) the definition of a universal, dynamic,
fault-tolerant, and load-balanced computational framework adapted to all kinds
of computational platforms (massively parallel machines, clusters, or
distributed grids). These strategies have been implemented in the QMC=Chem code
developed at Toulouse and illustrated with numerical applications on small
peptides of increasing sizes (158, 434, 1056 and 1731 electrons). Using 10k-80k
computing cores of the Curie machine (GENCI-TGCC-CEA, France) QMC=Chem has been
shown to be capable of running at the petascale level, thus demonstrating that
for this machine a large part of the peak performance can be achieved.
Implementation of large-scale QMC simulations for future exascale platforms
with a comparable level of efficiency is expected to be feasible
Parallel implementation of the TRANSIMS micro-simulation
This paper describes the parallel implementation of the TRANSIMS traffic
micro-simulation. The parallelization method is domain decomposition, which
means that each CPU of the parallel computer is responsible for a different
geographical area of the simulated region. We describe how information between
domains is exchanged, and how the transportation network graph is partitioned.
An adaptive scheme is used to optimize load balancing. We then demonstrate how
computing speeds of our parallel micro-simulations can be systematically
predicted once the scenario and the computer architecture are known. This makes
it possible, for example, to decide if a certain study is feasible with a
certain computing budget, and how to invest that budget. The main ingredients
of the prediction are knowledge about the parallel implementation of the
micro-simulation, knowledge about the characteristics of the partitioning of
the transportation network graph, and knowledge about the interaction of these
quantities with the computer system. In particular, we investigate the
differences between switched and non-switched topologies, and the effects of 10
Mbit, 100 Mbit, and Gbit Ethernet. keywords: Traffic simulation, parallel
computing, transportation planning, TRANSIM
Energy Detection UWB Receiver Design using a Multi-resolution VHDL-AMS Description
Ultra Wide Band (UWB) impulse radio systems are appealing for location-aware applications. There is a growing interest in the design of UWB transceivers with reduced complexity and power consumption. Non-coherent approaches for the design of the receiver based on energy detection schemes seem suitable to this aim and have been adopted in the project the preliminary results of which are reported in this paper. The objective is the design of a UWB receiver with a top-down methodology, starting from Matlab-like models and refining the description down to the final transistor level. This goal will be achieved with an integrated use of VHDL for the digital blocks and VHDL-AMS for the mixed-signal and analog circuits. Coherent results are obtained using VHDL-AMS and Matlab. However, the CPU time cost strongly depends on the description used in the VHDL-AMS models. In order to show the functionality of the UWB architecture, the receiver most critical functions are simulated showing results in good agreement with the expectations
Harvesting graphics power for MD simulations
We discuss an implementation of molecular dynamics (MD) simulations on a
graphic processing unit (GPU) in the NVIDIA CUDA language. We tested our code
on a modern GPU, the NVIDIA GeForce 8800 GTX. Results for two MD algorithms
suitable for short-ranged and long-ranged interactions, and a congruential
shift random number generator are presented. The performance of the GPU's is
compared to their main processor counterpart. We achieve speedups of up to 80,
40 and 150 fold, respectively. With newest generation of GPU's one can run
standard MD simulations at 10^7 flops/$.Comment: 12 pages, 5 figures. Submitted to Mol. Si
- âŠ