2,107 research outputs found
Overview of Hydra: a concurrent language for synchronous digital circuit design
Hydra is a computer hardware description language that integrates several kinds of software tool (simulation, netlist generation and timing analysis) within a single circuit specification. The design language is inherently concurrent, and it offers black box abstraction and general design patterns that simplify the design of circuits with regular structure. Hydra specifications are concise, allowing the complete design of a computer system as a digital circuit within a few pages. This paper discusses the motivations behind Hydra, and illustrates the system with a significant portion of the design of a basic RISC processor
A Parallel Adaptive P3M code with Hierarchical Particle Reordering
We discuss the design and implementation of HYDRA_OMP a parallel
implementation of the Smoothed Particle Hydrodynamics-Adaptive P3M (SPH-AP3M)
code HYDRA. The code is designed primarily for conducting cosmological
hydrodynamic simulations and is written in Fortran77+OpenMP. A number of
optimizations for RISC processors and SMP-NUMA architectures have been
implemented, the most important optimization being hierarchical reordering of
particles within chaining cells, which greatly improves data locality thereby
removing the cache misses typically associated with linked lists. Parallel
scaling is good, with a minimum parallel scaling of 73% achieved on 32 nodes
for a variety of modern SMP architectures. We give performance data in terms of
the number of particle updates per second, which is a more useful performance
metric than raw MFlops. A basic version of the code will be made available to
the community in the near future.Comment: 34 pages, 12 figures, accepted for publication in Computer Physics
Communication
Optimisation and parallelism in synchronous digital circuit simulators
Digital circuit simulation often requires a large amount of computation, resulting in long run times. We consider several techniques for optimising a brute force synchronous
circuit simulator: an algorithm using an event queue that avoids recalculating quiescent parts of the circuit, a marking algorithm that is similar to the event queue but that avoids a central data structure, and a lazy algorithm that avoids calculating signals whose values are not needed. Two target architectures for the simulator are used: a sequential CPU, and a parallel GPGPU. The interactions between the different optimisations are discussed, and the performance is measured while the algorithms are simulating a simple but realistic scalable circuit
A holistic scalable implementation approach of the lattice Boltzmann method for CPU/GPU heterogeneous clusters
This is the author accepted manuscript. The final version is available from MDPI via the DOI in this record.Heterogeneous clusters are a widely utilized class of supercomputers assembled from
different types of computing devices, for instance CPUs and GPUs, providing a huge computational
potential. Programming them in a scalable way exploiting the maximal performance introduces
numerous challenges such as optimizations for different computing devices, dealing with multiple
levels of parallelism, the application of different programming models, work distribution, and hiding
of communication with computation. We utilize the lattice Boltzmann method for fluid flow as
a representative of a scientific computing application and develop a holistic implementation for
large-scale CPU/GPU heterogeneous clusters. We review and combine a set of best practices and
techniques ranging from optimizations for the particular computing devices to the orchestration
of tens of thousands of CPU cores and thousands of GPUs. Eventually, we come up with
an implementation using all the available computational resources for the lattice Boltzmann
method operators. Our approach shows excellent scalability behavior making it future-proof for
heterogeneous clusters of the upcoming architectures on the exaFLOPS scale. Parallel efficiencies of
more than 90% are achieved leading to 2,604.72 GLUPS utilizing 24,576 CPU cores and 2,048 GPUs of
the CPU/GPU heterogeneous cluster Piz Daint and computing more than 6.8 · 109
lattice cells.This work was supported by the German Research Foundation (DFG) as part of the
Transregional Collaborative Research Centre “Invasive Computing” (SFB/TR 89). In addition, this work was
supported by a grant from the Swiss National Supercomputing Centre (CSCS) under project ID d68. We further
thank the Max Planck Computing & Data Facility (MPCDF) and the Global Scientific Information and Computing
Center (GSIC) for providing computational resources
- …