44 research outputs found
A Comparison of some recent Task-based Parallel Programming Models
The need for parallel programming models that are simple to use and at the same time efficient for current ant future parallel platforms has led to recent attention to task-based models such as Cilk++, Intel TBB and the task concept in OpenMP version 3.0. The choice of model and implementation can have a major impact on the final performance and in order to understand some of the trade-offs we have made a quantitative study comparing four implementations of OpenMP (gcc, Intel icc, Sun studio and the research compiler Mercurium/nanos mcc), Cilk++ and Wool, a high-performance task-based library developed at SICS.
Abstract. We use microbenchmarks to characterize costs for task-creation and stealing and the Barcelona OpenMP Tasks Suite for characterizing application performance. By far Wool and Cilk++ have the lowest overhead in both spawning and stealing tasks. This is reflected in application performance when many tasks with small granularity are spawned where Cilk++ and, in particular, has the highest performance. For coarse granularity applications, the OpenMP implementations have quite similar performance as the more light-weight Cilk++ and Wool except for one application where mcc is superior thanks to a superior task scheduler.
Abstract. The OpenMP implemenations are generally not yet ready for use when the task granularity becomes very small. There is no inherent reason for this, so we expect future implementations of OpenMP to focus on this issue
Менталитет восточных славян и социокультурные аспекты интеграционных процессов в Белорусско-Российско-Украинском приграничьи
В статті представлені результати дослідження менталъних особливостей росіян, украінців та білорусів. Зафіксована висока ступінь близькоесті самооцінок та взаемних оцінок i'xніx характерологічних рис. Робиться висновок про те, що спорідненість цих народів створюе сприятливі передумови для розвитку інтеграційних процесів в російсько-білорусько-украінсъкому прикордонт
tf-Darshan: Understanding Fine-grained I/O Performance in Machine Learning Workloads
Machine Learning applications on HPC systems have been gaining popularity in
recent years. The upcoming large scale systems will offer tremendous
parallelism for training through GPUs. However, another heavy aspect of Machine
Learning is I/O, and this can potentially be a performance bottleneck.
TensorFlow, one of the most popular Deep-Learning platforms, now offers a new
profiler interface and allows instrumentation of TensorFlow operations.
However, the current profiler only enables analysis at the TensorFlow platform
level and does not provide system-level information. In this paper, we extend
TensorFlow Profiler and introduce tf-Darshan, both a profiler and tracer, that
performs instrumentation through Darshan. We use the same Darshan shared
instrumentation library and implement a runtime attachment without using a
system preload. We can extract Darshan profiling data structures during
TensorFlow execution to enable analysis through the TensorFlow profiler. We
visualize the performance results through TensorBoard, the web-based TensorFlow
visualization tool. At the same time, we do not alter Darshan's existing
implementation. We illustrate tf-Darshan by performing two case studies on
ImageNet image and Malware classification. We show that by guiding optimization
using data from tf-Darshan, we increase POSIX I/O bandwidth by up to 19% by
selecting data for staging on fast tier storage. We also show that Darshan has
the potential of being used as a runtime library for profiling and providing
information for future optimization.Comment: Accepted for publication at the 2020 International Conference on
Cluster Computing (CLUSTER 2020
Optimization of Tensor-product Operations in Nekbone on GPUs
In the CFD solver Nek5000, the computation is dominated by the evaluation of
small tensor operations. Nekbone is a proxy app for Nek5000 and has previously
been ported to GPUs with a mixed OpenACC and CUDA approach. In this work, we
continue this effort and optimize the main tensor-product operation in Nekbone
further. Our optimization is done in CUDA and uses a different, 2D, thread
structure to make the computations layer by layer. This enables us to use loop
unrolling as well as utilize registers and shared memory efficiently. Our
implementation is then compared on both the Pascal and Volta GPU architectures
to previous GPU versions of Nekbone as well as a measured roofline. The results
show that our implementation outperforms previous GPU Nekbone implementations
by 6-10%. Compared to the measured roofline, we obtain 77 - 92% of the peak
performance for both Nvidia P100 and V100 GPUs for inputs with 1024 - 4096
elements and polynomial degree 9.Comment: 4 pages, 4 figure
sputniPIC: an Implicit Particle-in-Cell Code for Multi-GPU Systems
Large-scale simulations of plasmas are essential for advancing our
understanding of fusion devices, space, and astrophysical systems.
Particle-in-Cell (PIC) codes have demonstrated their success in simulating
numerous plasma phenomena on HPC systems. Today, flagship supercomputers
feature multiple GPUs per compute node to achieve unprecedented computing power
at high power efficiency. PIC codes require new algorithm design and
implementation for exploiting such accelerated platforms. In this work, we
design and optimize a three-dimensional implicit PIC code, called sputniPIC, to
run on a general multi-GPU compute node. We introduce a particle decomposition
data layout, in contrast to domain decomposition on CPU-based implementations,
to use particle batches for overlapping communication and computation on GPUs.
sputniPIC also natively supports different precision representations to achieve
speed up on hardware that supports reduced precision. We validate sputniPIC
through the well-known GEM challenge and provide performance analysis. We test
sputniPIC on three multi-GPU platforms and report a 200-800x performance
improvement with respect to the sputniPIC CPU OpenMP version performance. We
show that reduced precision could further improve performance by 45% to 80% on
the three platforms. Because of these performance improvements, on a single
node with multiple GPUs, sputniPIC enables large-scale three-dimensional PIC
simulations that were only possible using clusters.Comment: Accepted for publication at the 32nd International Symposium on
Computer Architecture and High Performance Computing (SBAC-PAD 2020