25 research outputs found
Distributed N-body Simulation on the Grid Using Dedicated Hardware
We present performance measurements of direct gravitational N -body
simulation on the grid, with and without specialized (GRAPE-6) hardware. Our
inter-continental virtual organization consists of three sites, one in Tokyo,
one in Philadelphia and one in Amsterdam. We run simulations with up to 196608
particles for a variety of topologies. In many cases, high performance
simulations over the entire planet are dominated by network bandwidth rather
than latency. With this global grid of GRAPEs our calculation time remains
dominated by communication over the entire range of N, which was limited due to
the use of three sites. Increasing the number of particles will result in a
more efficient execution. Based on these timings we construct and calibrate a
model to predict the performance of our simulation on any grid infrastructure
with or without GRAPE. We apply this model to predict the simulation
performance on the Netherlands DAS-3 wide area computer. Equipping the DAS-3
with GRAPE-6Af hardware would achieve break-even between calculation and
communication at a few million particles, resulting in a compute time of just
over ten hours for 1 N -body time unit. Key words: high-performance computing,
grid, N-body simulation, performance modellingComment: (in press) New Astronomy, 24 pages, 5 figure
4.45 Pflops Astrophysical N-Body Simulation on K computer -- The Gravitational Trillion-Body Problem
As an entry for the 2012 Gordon-Bell performance prize, we report performance
results of astrophysical N-body simulations of one trillion particles performed
on the full system of K computer. This is the first gravitational trillion-body
simulation in the world. We describe the scientific motivation, the numerical
algorithm, the parallelization strategy, and the performance analysis. Unlike
many previous Gordon-Bell prize winners that used the tree algorithm for
astrophysical N-body simulations, we used the hybrid TreePM method, for similar
level of accuracy in which the short-range force is calculated by the tree
algorithm, and the long-range force is solved by the particle-mesh algorithm.
We developed a highly-tuned gravity kernel for short-range forces, and a novel
communication algorithm for long-range forces. The average performance on 24576
and 82944 nodes of K computer are 1.53 and 4.45 Pflops, which correspond to 49%
and 42% of the peak speed.Comment: 10 pages, 6 figures, Proceedings of Supercomputing 2012
(http://sc12.supercomputing.org/), Gordon Bell Prize Winner. Additional
information is http://www.ccs.tsukuba.ac.jp/CCS/eng/gbp201
Benchmarking Edge Computing Devices for Grape Bunches and Trunks Detection using Accelerated Object Detection Single Shot MultiBox Deep Learning Models
Purpose: Visual perception enables robots to perceive the environment. Visual
data is processed using computer vision algorithms that are usually
time-expensive and require powerful devices to process the visual data in
real-time, which is unfeasible for open-field robots with limited energy. This
work benchmarks the performance of different heterogeneous platforms for object
detection in real-time. This research benchmarks three architectures: embedded
GPU -- Graphical Processing Units (such as NVIDIA Jetson Nano 2 GB and 4 GB,
and NVIDIA Jetson TX2), TPU -- Tensor Processing Unit (such as Coral Dev Board
TPU), and DPU -- Deep Learning Processor Unit (such as in AMD-Xilinx ZCU104
Development Board, and AMD-Xilinx Kria KV260 Starter Kit). Method: The authors
used the RetinaNet ResNet-50 fine-tuned using the natural VineSet dataset.
After the trained model was converted and compiled for target-specific hardware
formats to improve the execution efficiency. Conclusions and Results: The
platforms were assessed in terms of performance of the evaluation metrics and
efficiency (time of inference). Graphical Processing Units (GPUs) were the
slowest devices, running at 3 FPS to 5 FPS, and Field Programmable Gate Arrays
(FPGAs) were the fastest devices, running at 14 FPS to 25 FPS. The efficiency
of the Tensor Processing Unit (TPU) is irrelevant and similar to NVIDIA Jetson
TX2. TPU and GPU are the most power-efficient, consuming about 5W. The
performance differences, in the evaluation metrics, across devices are
irrelevant and have an F1 of about 70 % and mean Average Precision (mAP) of
about 60 %
GADGET: A code for collisionless and gasdynamical cosmological simulations
We describe the newly written code GADGET which is suitable both for
cosmological simulations of structure formation and for the simulation of
interacting galaxies. GADGET evolves self-gravitating collisionless fluids with
the traditional N-body approach, and a collisional gas by smoothed particle
hydrodynamics. Along with the serial version of the code, we discuss a parallel
version that has been designed to run on massively parallel supercomputers with
distributed memory. While both versions use a tree algorithm to compute
gravitational forces, the serial version of GADGET can optionally employ the
special-purpose hardware GRAPE instead of the tree. Periodic boundary
conditions are supported by means of an Ewald summation technique. The code
uses individual and adaptive timesteps for all particles, and it combines this
with a scheme for dynamic tree updates. Due to its Lagrangian nature, GADGET
thus allows a very large dynamic range to be bridged, both in space and time.
So far, GADGET has been successfully used to run simulations with up to 7.5e7
particles, including cosmological studies of large-scale structure formation,
high-resolution simulations of the formation of clusters of galaxies, as well
as workstation-sized problems of interacting galaxies. In this study, we detail
the numerical algorithms employed, and show various tests of the code. We
publically release both the serial and the massively parallel version of the
code.Comment: 32 pages, 14 figures, replaced to match published version in New
Astronomy. For download of the code, see
http://www.mpa-garching.mpg.de/gadget (new version 1.1 available
Scaling Hierarchical N-body Simulations on GPU Clusters
Abstract — This paper focuses on the use of GPGPU-based clus-ters for hierarchical N-body simulations. Whereas the behavior of these hierarchical methods has been studied in the past on CPU-based architectures, we investigate key performance issues in the context of clusters of GPUs. These include kernel orga-nization and efficiency, the balance between tree traversal and force computation work, grain size selection through the tuning of offloaded work request sizes, and the reduction of sequential bottlenecks. The effects of various application parameters are studied and experiments done to quantify gains in performance. Our studies are carried out in the context of a production-quality parallel cosmological simulator called ChaNGa. We highlight the re-engineering of the application to make it more suitable for GPU-based environments. Finally, we present performance results from experiments on the NCSA Lincoln GPU cluster, including a note on GPU use in multistepped simulations
The gravitational billion body problem : Het miljard deeltjes probleem
The increased availability of accelerator technology in modern supercomputers forces users to redesign their algorithms. These accelerators are specifically designed to offer huge amounts of parallel compute power. In this thesis I show how to harness the power of these parallel processors for astrophysical simulations. I start with an introduction that presents the developments in astrophysical algorithms and used hardware since the 1960__s till today. In the following scientific chapters I discuss the use of GPU accelerator technology for direct N-body methods and for the more advanced hierarchical algorithms. These advanced algorithms are more complex to implement on large parallel architectures, but by redesigning the algorithms it is possible to take advantage of the GPU. The developed algorithms are applied to simulate galaxy mergers to explain discrepancies in observational results. In the simulations we test different merger configurations and try to match the results with observational data. The final chapter shows how to scale the developed software code to thousands of GPUs as available in the Titan supercomputer. The in this thesis developed and presented algorithms allow astronomers to take advantage of the new GPU technology and thereby run simulations that contain thousand times more particles than was possible beforeNWOUBL - phd migration 201
3D segmentation and localization using visual cues in uncontrolled environments
3D scene understanding is an important area in robotics, autonomous vehicles, and virtual reality. The goal of scene understanding is to recognize and localize all the objects around the agent. This is done through semantic segmentation and depth estimation. Current approaches focus on improving the robustness to solve each task but fail in making them efficient for real-time usage. This thesis presents four efficient methods for scene understanding that work in real environments. The methods also aim to provide a solution for 2D and 3D data.
The first approach presents a pipeline that combines the block matching algorithm for disparity estimation, an encoder-decoder neural network for semantic segmentation, and a refinement step that uses both outputs to complete the regions that were not labelled or did not have any disparity assigned to them. This method provides accurate results in 3D reconstruction and morphology estimation of complex structures like rose bushes. Due to the lack of datasets of rose bushes and their segmentation, we also made three large datasets. Two of them have real roses that were manually labelled, and the third one was created using a scene modeler and 3D rendering software. The last dataset aims to capture diversity, realism and obtain different types of labelling.
The second contribution provides a strategy for real-time rose pruning using visual servoing of a robotic arm and our previous approach. Current methods obtain the structure of the plant and plan the cutting trajectory using only a global planner and assume a constant background. Our method works in real environments and uses visual feedback to refine the location of the cutting targets and modify the planned trajectory. The proposed visual servoing allows the robot to reach the cutting points 94% of the time. This is an improvement compared to only using a global planner without visual feedback, which reaches the targets 50% of the time. To the best of our knowledge, this is the first robot able to prune a complete rose bush in a natural environment.
Recent deep learning image segmentation and disparity estimation networks provide accurate results. However, most of these methods are computationally expensive, which makes them impractical for real-time tasks. Our third contribution uses multi-task learning to learn the image segmentation and disparity estimation together end-to-end. The experiments show that our network has at most 1/3 of the parameters of the state-of-the-art of each individual task and still provides competitive results.
The last contribution explores the area of scene understanding using 3D data. Recent approaches use point-based networks to do point cloud segmentation and find local relations between points using only the latent features provided by the network, omitting the geometric information from the point clouds. Our approach aggregates the geometric information into the network. Given that the geometric and latent features are different, our network also uses a two-headed attention mechanism to do local aggregation at the latent and geometric level. This additional information helps the network to obtain a more accurate semantic segmentation, in real point cloud data, using fewer parameters than current methods. Overall, the method obtains the state-of-the-art segmentation in the real datasets S3DIS with 69.2% and competitive results in the ModelNet40 and ShapeNetPart datasets