261 research outputs found
Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-dimensional Bilateral Filter
This report explores using GPUs as a platform for performing high performance medical image data processing, specifically smoothing using a 3D bilateral filter, which performs anisotropic, edge-preserving smoothing. The algorithm consists of a running a specialized 3D convolution kernel over a source volume to produce an output volume. Overall, our objective is to understand what algorithmic design choices and configuration options lead to optimal performance of this algorithm on the GPU. We explore the performance impact of using different memory access patterns, of using different types of device/on-chip memories, of using strictly aligned and unaligned memory, and of varying the size/shape of thread blocks. Our results reveal optimal configuration parameters for our algorithm when executed sample 3D medical data set, and show performance gains ranging from 30x to over 200x as compared to a single-threaded CPU implementation
Towards a Scalable In Situ Fast Fourier Transform
The Fast Fourier Transform (FFT) is a numerical operation that transforms a
function into a form comprised of its constituent frequencies and is an
integral part of scientific computation and data analysis. The objective of our
work is to enable use of the FFT as part of a scientific in situ processing
chain to facilitate the analysis of data in the spectral regime. We describe
the implementation of an FFT endpoint for the transformation of
multi-dimensional data within the SENSEI infrastructure. Our results show its
use on a sample problem in the context of a multi-stage in situ processing
workflow.Comment: 5 pages, 2 figures. Submitted to ISAV workshop in SC23 conferenc
DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives
We present a new parallel algorithm for probabilistic graphical model
optimization. The algorithm relies on data-parallel primitives (DPPs), which
provide portable performance over hardware architecture. We evaluate results on
CPUs and GPUs for an image segmentation problem. Compared to a serial baseline,
we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare
our performance to a reference, OpenMP-based algorithm, and find speedups of up
to 7X (CPU).Comment: LDAV 2018, October 201
Recommended from our members
Improving Performance of M-to-N Processing and Data Redistribution in In Transit Analysis and Visualization
In an in transit setting, a parallel data producer, such as a numerical simulation, runs on one set of ranks M, while a data consumer, such as a parallel visualization application, runs on a different set of ranks N. One of the central challenges in this in transit setting is to determine the mapping of data from the set of M producer ranks to the set of N consumer ranks. This is a challenging problem for several reasons, such as the producer and consumer codes potentially having different scaling characteristics and different data models. The resulting mapping from M to N ranks can have a significant impact on aggregate application performance. In this work, we present an approach for performing this M-to-N mapping in a way that has broad applicability across a diversity of data producer and consumer applications. We evaluate its design and performance with
a study that runs at high concurrency on a modern HPC platform. By leveraging design characteristics, which facilitate an “intelligent” mapping from M-to-N, we observe significant performance gains are possible in terms of several different metrics, including time-to-solution and amount of data moved
Recommended from our members
Hybrid Parallelism for Volume Rendering on Large, Multi- and Many-core Systems
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells. The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance
Recommended from our members
Interactive, Internet Delivery of Visualization via Structured,Prerendered multiresolution Imagery
One of the fundamental problems in remote visualization --where I/O and data intensive visualization activities take place at acentrally located supercomputer center and resulting imagery is deliveredto a remotely located user -- is reduced interactivity resulting from thecombination of high network latency and relatively low network bandwidth.This research project has produced a novel approach for latency-tolerantdelivery of visualization and rendering results where client-side framerate display performance is independent of source dataset size, imagesize, visualization technique or rendering complexity. As such, it is asuitable solution for remote visualization image delivery for anyvisualization or rendering application that can generate image frames inan ordered fashion. This new capability is suitable for use in addressingmany of ASCR s remote visualization needs, particularly deployment atopen computing facilities to provide remote visualization capabilities toteams of scientific researchers
Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels
Measurements of absolute runtime are useful as a summary of performance when
studying parallel visualization and analysis methods on computational platforms
of increasing concurrency and complexity. We can obtain even more insights by
measuring and examining more detailed measures from hardware performance
counters, such as the number of instructions executed by an algorithm
implemented in a particular way, the amount of data moved to/from memory,
memory hierarchy utilization levels via cache hit/miss ratios, and so forth.
This work focuses on performance analysis on modern multi-core platforms of
three different visualization and analysis kernels that are implemented in
different ways: one is "traditional", using combinations of C++ and VTK, and
the other uses a data-parallel approach using VTK-m. Our performance study
consists of measurement and reporting of several different hardware performance
counters on two different multi-core CPU platforms. The results reveal
interesting performance differences between these two different approaches for
implementing these kernels, results that would not be apparent using runtime as
the only metric
Recommended from our members
Federal Market Information Technology in the Post Flash Crash Era: Roles for Supercomputing
This paper describes collaborative work between active traders, regulators, economists, and supercomputing researchers to replicate and extend investigations of the Flash Crash and other market anomalies in a National Laboratory HPC environment. Our work suggests that supercomputing tools and methods will be valuable to market regulators in achieving the goal of market safety, stability, and security. Research results using high frequency data and analytics are described, and directions for future development are discussed. Currently the key mechanism for preventing catastrophic market action are “circuit breakers.” We believe a more graduated approach, similar to the “yellow light” approach in motorsports to slow down traffic, might be a better way to achieve the same goal. To enable this objective, we study a number of indicators that could foresee hazards in market conditions and explore options to confirm such predictions. Our tests confirm that Volume Synchronized Probability of Informed Trading (VPIN) and a version of volume Herfindahl-Hirschman Index (HHI) for measuring market fragmentation can indeed give strong signals ahead of the Flash Crash event on May 6 2010. This is a preliminary step toward a full-fledged early-warning system for unusual market conditions
Recommended from our members
Deploying Web-based Visual Exploration Tools on the Grid
We discuss a web-based portal for the exploration, encapsulation, and dissemination of visualization results over the Grid. This portal integrates three components: an interface client for structured visualization exploration, a visualization web application to manage the generation and capture of the visualization results, and a centralized portal application server to access and manage grid resources. Our approach uses standard web technologies to make the system accessible with minimal user setup. We demonstrate the usefulness of the developed system using an example for Adaptive Mesh Refinement (AMR) data visualization
Recommended from our members
Bin-Hash Indexing: A Parallel Method for Fast Query Processing
This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not be resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies
- …