Search CORE

261 research outputs found

Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-dimensional Bilateral Filter

Author: Bethel E. Wes
Bethel E. Wes
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 06/01/2012
Field of study

This report explores using GPUs as a platform for performing high performance medical image data processing, specifically smoothing using a 3D bilateral filter, which performs anisotropic, edge-preserving smoothing. The algorithm consists of a running a specialized 3D convolution kernel over a source volume to produce an output volume. Overall, our objective is to understand what algorithmic design choices and configuration options lead to optimal performance of this algorithm on the GPU. We explore the performance impact of using different memory access patterns, of using different types of device/on-chip memories, of using strictly aligned and unaligned memory, and of varying the size/shape of thread blocks. Our results reveal optimal configuration parameters for our algorithm when executed sample 3D medical data set, and show performance gains ranging from 30x to over 200x as compared to a single-threaded CPU implementation

Crossref

UNT Digital Library

Towards a Scalable In Situ Fast Fourier Transform

Author: Bethel E. Wes
Kulkarni Sudhanshu
Loring Burlen
Publication venue
Publication date: 02/02/2024
Field of study

The Fast Fourier Transform (FFT) is a numerical operation that transforms a function into a form comprised of its constituent frequencies and is an integral part of scientific computation and data analysis. The objective of our work is to enable use of the FFT as part of a scientific in situ processing chain to facilitate the analysis of data in the spectral regime. We describe the implementation of an FFT endpoint for the transformation of multi-dimensional data within the SENSEI infrastructure. Our results show its use on a sample problem in the context of a multi-stage in situ processing workflow.Comment: 5 pages, 2 figures. Submitted to ISAV workshop in SC23 conferenc

arXiv.org e-Print Archive

DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

Author: Bethel E. Wes
Camp David
Childs Hank
Heinemann Colleen
Lessley Brenton
Perciano Talita
Publication venue
Publication date: 13/09/2018
Field of study

We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

Recommended from our members

Improving Performance of M-to-N Processing and Data Redistribution in In Transit Analysis and Visualization

Author: Bethel E Wes
Ferrier Nicola
Gu Junmin
Kress James
Logan Jeremey
Loring Burlen
Rizzi Silvio
Shudler Sergei
Wolf Matthew
Publication venue: eScholarship, University of California
Publication date: 25/05/2020
Field of study

In an in transit setting, a parallel data producer, such as a numerical simulation, runs on one set of ranks M, while a data consumer, such as a parallel visualization application, runs on a different set of ranks N. One of the central challenges in this in transit setting is to determine the mapping of data from the set of M producer ranks to the set of N consumer ranks. This is a challenging problem for several reasons, such as the producer and consumer codes potentially having different scaling characteristics and different data models. The resulting mapping from M to N ranks can have a significant impact on aggregate application performance. In this work, we present an approach for performing this M-to-N mapping in a way that has broad applicability across a diversity of data producer and consumer applications. We evaluate its design and performance with a study that runs at high concurrency on a modern HPC platform. By leveraging design characteristics, which facilitate an “intelligent” mapping from M-to-N, we observe significant performance gains are possible in terms of several different metrics, including time-to-solution and amount of data moved

eScholarship - University of California

Recommended from our members

Hybrid Parallelism for Volume Rendering on Large, Multi- and Many-core Systems

Author: Bethel E. Wes
Childs Hank
Howison Mark
Publication venue: Lawrence Berkeley National Laboratory
Publication date: 01/01/2011
Field of study

With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells. The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance

UNT Digital Library

Recommended from our members

Interactive, Internet Delivery of Visualization via Structured,Prerendered multiresolution Imagery

Author: Bethel E. Wes
Chen Jerry
Yoon Ilmi
Publication venue: 'Office of Scientific and Technical Information (OSTI)'
Publication date: 25/10/2007
Field of study

One of the fundamental problems in remote visualization --where I/O and data intensive visualization activities take place at acentrally located supercomputer center and resulting imagery is deliveredto a remotely located user -- is reduced interactivity resulting from thecombination of high network latency and relatively low network bandwidth.This research project has produced a novel approach for latency-tolerantdelivery of visualization and rendering results where client-side framerate display performance is independent of source dataset size, imagesize, visualization technique or rendering complexity. As such, it is asuitable solution for remote visualization image delivery for anyvisualization or rendering application that can generate image frames inan ordered fashion. This new capability is suitable for use in addressingmany of ASCR s remote visualization needs, particularly deployment atopen computing facilities to provide remote visualization capabilities toteams of scientific researchers

UNT Digital Library

Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels

Author: Bethel E. Wes
Camp David
Heinemann Colleen
Perciano Talita
Publication venue
Publication date: 05/10/2020
Field of study

Measurements of absolute runtime are useful as a summary of performance when studying parallel visualization and analysis methods on computational platforms of increasing concurrency and complexity. We can obtain even more insights by measuring and examining more detailed measures from hardware performance counters, such as the number of instructions executed by an algorithm implemented in a particular way, the amount of data moved to/from memory, memory hierarchy utilization levels via cache hit/miss ratios, and so forth. This work focuses on performance analysis on modern multi-core platforms of three different visualization and analysis kernels that are implemented in different ways: one is "traditional", using combinations of C++ and VTK, and the other uses a data-parallel approach using VTK-m. Our performance study consists of measurement and reporting of several different hardware performance counters on two different multi-core CPU platforms. The results reveal interesting performance differences between these two different approaches for implementing these kernels, results that would not be apparent using runtime as the only metric

arXiv.org e-Print Archive

eScholarship - University of California

Recommended from our members

Federal Market Information Technology in the Post Flash Crash Era: Roles for Supercomputing

Author: Bethel E. Wes
Leinweber David
Ruebel Oliver
Wu Kesheng
Publication venue: Lawrence Berkeley National Laboratory
Publication date: 16/09/2011
Field of study

This paper describes collaborative work between active traders, regulators, economists, and supercomputing researchers to replicate and extend investigations of the Flash Crash and other market anomalies in a National Laboratory HPC environment. Our work suggests that supercomputing tools and methods will be valuable to market regulators in achieving the goal of market safety, stability, and security. Research results using high frequency data and analytics are described, and directions for future development are discussed. Currently the key mechanism for preventing catastrophic market action are “circuit breakers.” We believe a more graduated approach, similar to the “yellow light” approach in motorsports to slow down traffic, might be a better way to achieve the same goal. To enable this objective, we study a number of indicators that could foresee hazards in market conditions and explore options to confirm such predictions. Our tests confirm that Volume Synchronized Probability of Informed Trading (VPIN) and a version of volume Herfindahl-Hirschman Index (HHI) for measuring market fragmentation can indeed give strong signals ahead of the Flash Crash event on May 6 2010. This is a preliminary step toward a full-fledged early-warning system for unusual market conditions

UNT Digital Library

Recommended from our members

Deploying Web-based Visual Exploration Tools on the Grid

Author: Bethel Wes
Hamann Bernd
Jankun-Kelly T. J.
Joy Ken
Kreylos Oliver
Ma Kwan-Liu
Shalf John M.
Publication venue: eScholarship, University of California
Publication date: 01/01/2003
Field of study

We discuss a web-based portal for the exploration, encapsulation, and dissemination of visualization results over the Grid. This portal integrates three components: an interface client for structured visualization exploration, a visualization web application to manage the generation and capture of the visualization results, and a centralized portal application server to access and manage grid resources. Our approach uses standard web technologies to make the system accessible with minimal user setup. We demonstrate the usefulness of the developed system using an example for Adaptive Mesh Refinement (AMR) data visualization

eScholarship - University of California

Recommended from our members

Bin-Hash Indexing: A Parallel Method for Fast Query Processing

Author: Bethel Edward W.
Bethel Edward Wes
Gosink Luke J.
Joy Kenneth I.
Owens John D.
Wu Kesheng
Publication venue: Lawrence Berkeley National Laboratory
Publication date: 27/06/2008
Field of study

This paper presents a new parallel indexing data structure for answering queries. The index, called Bin-Hash, offers extremely high levels of concurrency, and is therefore well-suited for the emerging commodity of parallel processors, such as multi-cores, cell processors, and general purpose graphics processing units (GPU). The Bin-Hash approach first bins the base data, and then partitions and separately stores the values in each bin as a perfect spatial hash table. To answer a query, we first determine whether or not a record satisfies the query conditions based on the bin boundaries. For the bins with records that can not be resolved, we examine the spatial hash tables. The procedures for examining the bin numbers and the spatial hash tables offer the maximum possible level of concurrency; all records are able to be evaluated by our procedure independently in parallel. Additionally, our Bin-Hash procedures access much smaller amounts of data than similar parallel methods, such as the projection index. This smaller data footprint is critical for certain parallel processors, like GPUs, where memory resources are limited. To demonstrate the effectiveness of Bin-Hash, we implement it on a GPU using the data-parallel programming language CUDA. The concurrency offered by the Bin-Hash index allows us to fully utilize the GPU's massive parallelism in our work; over 12,000 records can be simultaneously evaluated at any one time. We show that our new query processing method is an order of magnitude faster than current state-of-the-art CPU-based indexing technologies. Additionally, we compare our performance to existing GPU-based projection index strategies

UNT Digital Library