7,635 research outputs found
Real-Time Marker Level Set on GPU
International audienceLevel set methods have been extensively used to track the dynamical interfaces between different materials for physically based simulation, geometry modeling, oceanic modeling and other scientific and engineering applications. Due to the inherent Eulerian characteristics, interface evolution based on level set usually suffers from numerical diffusion, sharp feature missing and mass loss. Although some effective methods such as Particle Level Set (PLS) and Marker Level Set (MLS) have been proposed to tackle these difficulties, the complicated correction process and the high computational cost pose severe limitations for real-time applications. In this paper we provide an efficient parallel implementation of the Marker Level Set method on latest graphics hardware. Each step of the MLS method is fully mapped on GPU with an innovative combination of different computation techniques. Relying on GPU's parallelism and flexible programmability, the method provides real-time performance for large size 2D examples and moderate 3D examples, which is significantly faster than previous CPU based methods
Evaluation of GPU/CPU Co-Processing Models for JPEG 2000 Packetization
With the bottom-line goal of increasing the
throughput of a GPU-accelerated JPEG 2000 encoder, this paper
evaluates whether the post-compression rate control and
packetization routines should be carried out on the CPU or on
the GPU. Three co-processing models that differ in how the
workload is split among the CPU and GPU are introduced. Both
routines are discussed and algorithms for executing them in
parallel are presented. Experimental results for compressing a
detail-rich UHD sequence to 4 bits/sample indicate speed-ups of
200x for the rate control and 100x for the packetization
compared to the single-threaded implementation in the
commercial Kakadu library. These two routines executed on the
CPU take 4x as long as all remaining coding steps on the GPU
and therefore present a bottleneck. Even if the CPU bottleneck
could be avoided with multi-threading, it is still beneficial to
execute all coding steps on the GPU as this minimizes the
required device-to-host transfer and thereby speeds up the
critical path from 17.2 fps to 19.5 fps for 4 bits/sample and to
22.4 fps for 0.16 bits/sample
Efficient Irregular Wavefront Propagation Algorithms on Hybrid CPU-GPU Machines
In this paper, we address the problem of efficient execution of a computation
pattern, referred to here as the irregular wavefront propagation pattern
(IWPP), on hybrid systems with multiple CPUs and GPUs. The IWPP is common in
several image processing operations. In the IWPP, data elements in the
wavefront propagate waves to their neighboring elements on a grid if a
propagation condition is satisfied. Elements receiving the propagated waves
become part of the wavefront. This pattern results in irregular data accesses
and computations. We develop and evaluate strategies for efficient computation
and propagation of wavefronts using a multi-level queue structure. This queue
structure improves the utilization of fast memories in a GPU and reduces
synchronization overheads. We also develop a tile-based parallelization
strategy to support execution on multiple CPUs and GPUs. We evaluate our
approaches on a state-of-the-art GPU accelerated machine (equipped with 3 GPUs
and 2 multicore CPUs) using the IWPP implementations of two widely used image
processing operations: morphological reconstruction and euclidean distance
transform. Our results show significant performance improvements on GPUs. The
use of multiple CPUs and GPUs cooperatively attains speedups of 50x and 85x
with respect to single core CPU executions for morphological reconstruction and
euclidean distance transform, respectively.Comment: 37 pages, 16 figure
A Reduction of the Elastic Net to Support Vector Machines with an Application to GPU Computing
The past years have witnessed many dedicated open-source projects that built
and maintain implementations of Support Vector Machines (SVM), parallelized for
GPU, multi-core CPUs and distributed systems. Up to this point, no comparable
effort has been made to parallelize the Elastic Net, despite its popularity in
many high impact applications, including genetics, neuroscience and systems
biology. The first contribution in this paper is of theoretical nature. We
establish a tight link between two seemingly different algorithms and prove
that Elastic Net regression can be reduced to SVM with squared hinge loss
classification. Our second contribution is to derive a practical algorithm
based on this reduction. The reduction enables us to utilize prior efforts in
speeding up and parallelizing SVMs to obtain a highly optimized and parallel
solver for the Elastic Net and Lasso. With a simple wrapper, consisting of only
11 lines of MATLAB code, we obtain an Elastic Net implementation that naturally
utilizes GPU and multi-core CPUs. We demonstrate on twelve real world data
sets, that our algorithm yields identical results as the popular (and highly
optimized) glmnet implementation but is one or several orders of magnitude
faster.Comment: 10 page
Remote-scope Promotion: Clarified, Rectified, and Verified
Modern accelerator programming frameworks, such as OpenCL, organise threads into work-groups. Remote-scope promotion (RSP) is a language extension recently proposed by AMD researchers that is designed to enable applications, for the first time, both to optimise for the common case of intra-work-group communication (using memory scopes to provide consistency only within a work-group) and to allow occasional inter-work-group communication (as required, for instance, to support the popular load-balancing idiom of work stealing). We present the first formal, axiomatic memory model of OpenCL extended with RSP. We have extended the Herd memory model simulator with support for OpenCL kernels that exploit RSP, and used it to discover bugs in several litmus tests and a work-stealing queue, that have been used previously in the study of RSP. We have also formalised the proposed GPU implementation of RSP. The formalisation process allowed us to identify bugs in the description of RSP that could result in well-synchronised programs experiencing memory inconsistencies. We present and prove sound a new implementation of RSP that incorporates bug fixes and requires less non-standard hardware than the original implementation. This work, a collaboration between academia and industry, clearly demonstrates how, when designing hardware support for a new concurrent language feature, the early application of formal tools and techniques can help to prevent errors, such as those we have found, from making it into silicon
FlightGoggles: A Modular Framework for Photorealistic Camera, Exteroceptive Sensor, and Dynamics Simulation
FlightGoggles is a photorealistic sensor simulator for perception-driven
robotic vehicles. The key contributions of FlightGoggles are twofold. First,
FlightGoggles provides photorealistic exteroceptive sensor simulation using
graphics assets generated with photogrammetry. Second, it provides the ability
to combine (i) synthetic exteroceptive measurements generated in silico in real
time and (ii) vehicle dynamics and proprioceptive measurements generated in
motio by vehicle(s) in a motion-capture facility. FlightGoggles is capable of
simulating a virtual-reality environment around autonomous vehicle(s). While a
vehicle is in flight in the FlightGoggles virtual reality environment,
exteroceptive sensors are rendered synthetically in real time while all complex
extrinsic dynamics are generated organically through the natural interactions
of the vehicle. The FlightGoggles framework allows for researchers to
accelerate development by circumventing the need to estimate complex and
hard-to-model interactions such as aerodynamics, motor mechanics, battery
electrochemistry, and behavior of other agents. The ability to perform
vehicle-in-the-loop experiments with photorealistic exteroceptive sensor
simulation facilitates novel research directions involving, e.g., fast and
agile autonomous flight in obstacle-rich environments, safe human interaction,
and flexible sensor selection. FlightGoggles has been utilized as the main test
for selecting nine teams that will advance in the AlphaPilot autonomous drone
racing challenge. We survey approaches and results from the top AlphaPilot
teams, which may be of independent interest.Comment: Initial version appeared at IROS 2019. Supplementary material can be
found at https://flightgoggles.mit.edu. Revision includes description of new
FlightGoggles features, such as a photogrammetric model of the MIT Stata
Center, new rendering settings, and a Python AP
A GPU based real-time software correlation system for the Murchison Widefield Array prototype
Modern graphics processing units (GPUs) are inexpensive commodity hardware
that offer Tflop/s theoretical computing capacity. GPUs are well suited to many
compute-intensive tasks including digital signal processing.
We describe the implementation and performance of a GPU-based digital
correlator for radio astronomy. The correlator is implemented using the NVIDIA
CUDA development environment. We evaluate three design options on two
generations of NVIDIA hardware. The different designs utilize the internal
registers, shared memory and multiprocessors in different ways. We find that
optimal performance is achieved with the design that minimizes global memory
reads on recent generations of hardware.
The GPU-based correlator outperforms a single-threaded CPU equivalent by a
factor of 60 for a 32 antenna array, and runs on commodity PC hardware. The
extra compute capability provided by the GPU maximises the correlation
capability of a PC while retaining the fast development time associated with
using standard hardware, networking and programming languages. In this way, a
GPU-based correlation system represents a middle ground in design space between
high performance, custom built hardware and pure CPU-based software
correlation.
The correlator was deployed at the Murchison Widefield Array 32 antenna
prototype system where it ran in real-time for extended periods. We briefly
describe the data capture, streaming and correlation system for the prototype
array.Comment: 11 pages, to appear in PAS
Speed/accuracy trade-offs for modern convolutional object detectors
The goal of this paper is to serve as a guide for selecting a detection
architecture that achieves the right speed/memory/accuracy balance for a given
application and platform. To this end, we investigate various ways to trade
accuracy for speed and memory usage in modern convolutional object detection
systems. A number of successful systems have been proposed in recent years, but
apples-to-apples comparisons are difficult due to different base feature
extractors (e.g., VGG, Residual Networks), different default image resolutions,
as well as different hardware and software platforms. We present a unified
implementation of the Faster R-CNN [Ren et al., 2015], R-FCN [Dai et al., 2016]
and SSD [Liu et al., 2015] systems, which we view as "meta-architectures" and
trace out the speed/accuracy trade-off curve created by using alternative
feature extractors and varying other critical parameters such as image size
within each of these meta-architectures. On one extreme end of this spectrum
where speed and memory are critical, we present a detector that achieves real
time speeds and can be deployed on a mobile device. On the opposite end in
which accuracy is critical, we present a detector that achieves
state-of-the-art performance measured on the COCO detection task.Comment: Accepted to CVPR 201
- …