184 research outputs found
Exploration of Optimization Options for Increasing Performance of a GPU Implementation of a Three-dimensional Bilateral Filter
This report explores using GPUs as a platform for performing high performance medical image data processing, specifically smoothing using a 3D bilateral filter, which performs anisotropic, edge-preserving smoothing. The algorithm consists of a running a specialized 3D convolution kernel over a source volume to produce an output volume. Overall, our objective is to understand what algorithmic design choices and configuration options lead to optimal performance of this algorithm on the GPU. We explore the performance impact of using different memory access patterns, of using different types of device/on-chip memories, of using strictly aligned and unaligned memory, and of varying the size/shape of thread blocks. Our results reveal optimal configuration parameters for our algorithm when executed sample 3D medical data set, and show performance gains ranging from 30x to over 200x as compared to a single-threaded CPU implementation
DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives
We present a new parallel algorithm for probabilistic graphical model
optimization. The algorithm relies on data-parallel primitives (DPPs), which
provide portable performance over hardware architecture. We evaluate results on
CPUs and GPUs for an image segmentation problem. Compared to a serial baseline,
we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare
our performance to a reference, OpenMP-based algorithm, and find speedups of up
to 7X (CPU).Comment: LDAV 2018, October 201
Recommended from our members
Hybrid Parallelism for Volume Rendering on Large, Multi- and Many-core Systems
With the computing industry trending towards multi- and many-core processors, we study how a standard visualization algorithm, ray-casting volume rendering, can benefit from a hybrid parallelism approach. Hybrid parallelism provides the best of both worlds: using distributed-memory parallelism across a large numbers of nodes increases available FLOPs and memory, while exploiting shared-memory parallelism among the cores within each node ensures that each node performs its portion of the larger calculation as efficiently as possible. We demonstrate results from weak and strong scaling studies, at levels of concurrency ranging up to 216,000, and with datasets as large as 12.2 trillion cells. The greatest benefit from hybrid parallelism lies in the communication portion of the algorithm, the dominant cost at higher levels of concurrency. We show that reducing the number of participants with a hybrid approach significantly improves performance
Recommended from our members
Interactive, Internet Delivery of Visualization via Structured,Prerendered multiresolution Imagery
One of the fundamental problems in remote visualization --where I/O and data intensive visualization activities take place at acentrally located supercomputer center and resulting imagery is deliveredto a remotely located user -- is reduced interactivity resulting from thecombination of high network latency and relatively low network bandwidth.This research project has produced a novel approach for latency-tolerantdelivery of visualization and rendering results where client-side framerate display performance is independent of source dataset size, imagesize, visualization technique or rendering complexity. As such, it is asuitable solution for remote visualization image delivery for anyvisualization or rendering application that can generate image frames inan ordered fashion. This new capability is suitable for use in addressingmany of ASCR s remote visualization needs, particularly deployment atopen computing facilities to provide remote visualization capabilities toteams of scientific researchers
Recommended from our members
Improving Performance of M-to-N Processing and Data Redistribution in In Transit Analysis and Visualization
In an in transit setting, a parallel data producer, such as a numerical simulation, runs on one set of ranks M, while a data consumer, such as a parallel visualization application, runs on a different set of ranks N. One of the central challenges in this in transit setting is to determine the mapping of data from the set of M producer ranks to the set of N consumer ranks. This is a challenging problem for several reasons, such as the producer and consumer codes potentially having different scaling characteristics and different data models. The resulting mapping from M to N ranks can have a significant impact on aggregate application performance. In this work, we present an approach for performing this M-to-N mapping in a way that has broad applicability across a diversity of data producer and consumer applications. We evaluate its design and performance with
a study that runs at high concurrency on a modern HPC platform. By leveraging design characteristics, which facilitate an “intelligent” mapping from M-to-N, we observe significant performance gains are possible in terms of several different metrics, including time-to-solution and amount of data moved
Performance Analysis of Traditional and Data-Parallel Primitive Implementations of Visualization and Analysis Kernels
Measurements of absolute runtime are useful as a summary of performance when
studying parallel visualization and analysis methods on computational platforms
of increasing concurrency and complexity. We can obtain even more insights by
measuring and examining more detailed measures from hardware performance
counters, such as the number of instructions executed by an algorithm
implemented in a particular way, the amount of data moved to/from memory,
memory hierarchy utilization levels via cache hit/miss ratios, and so forth.
This work focuses on performance analysis on modern multi-core platforms of
three different visualization and analysis kernels that are implemented in
different ways: one is "traditional", using combinations of C++ and VTK, and
the other uses a data-parallel approach using VTK-m. Our performance study
consists of measurement and reporting of several different hardware performance
counters on two different multi-core CPU platforms. The results reveal
interesting performance differences between these two different approaches for
implementing these kernels, results that would not be apparent using runtime as
the only metric
Recommended from our members
Federal Market Information Technology in the Post Flash Crash Era: Roles for Supercomputing
This paper describes collaborative work between active traders, regulators, economists, and supercomputing researchers to replicate and extend investigations of the Flash Crash and other market anomalies in a National Laboratory HPC environment. Our work suggests that supercomputing tools and methods will be valuable to market regulators in achieving the goal of market safety, stability, and security. Research results using high frequency data and analytics are described, and directions for future development are discussed. Currently the key mechanism for preventing catastrophic market action are “circuit breakers.” We believe a more graduated approach, similar to the “yellow light” approach in motorsports to slow down traffic, might be a better way to achieve the same goal. To enable this objective, we study a number of indicators that could foresee hazards in market conditions and explore options to confirm such predictions. Our tests confirm that Volume Synchronized Probability of Informed Trading (VPIN) and a version of volume Herfindahl-Hirschman Index (HHI) for measuring market fragmentation can indeed give strong signals ahead of the Flash Crash event on May 6 2010. This is a preliminary step toward a full-fledged early-warning system for unusual market conditions
Accelerating Network Traffic Analytics Using Query-DrivenVisualization
Realizing operational analytics solutions where large and complex data must be analyzed in a time-critical fashion entails integrating many different types of technology. This paper focuses on an interdisciplinary combination of scientific data management and visualization/analysis technologies targeted at reducing the time required for data filtering, querying, hypothesis testing and knowledge discovery in the domain of network connection data analysis. We show that use of compressed bitmap indexing can quickly answer queries in an interactive visual data analysis application, and compare its performance with two alternatives for serial and parallel filtering/querying on 2.5 billion records worth of network connection data collected over a period of 42 weeks. Our approach to visual network connection data exploration centers on two primary factors: interactive ad-hoc and multiresolution query formulation and execution over n dimensions and visual display of then-dimensional histogram results. This combination is applied in a case study to detect a distributed network scan and to then identify the set of remote hosts participating in the attack. Our approach is sufficiently general to be applied to a diverse set of data understanding problems as well as used in conjunction with a diverse set of analysis and visualization tools
Recommended from our members
Query-Driven Visualization of Time-Varying Adaptive Mesh Refinement Data
The visualization and analysis of AMR-based simulations is integral to the process of obtaining new insight in scientific research. We present a new method for performing query-driven visualization and analysis on AMR data, with specific emphasis on time-varying AMR data. Our work introduces a new method that directly addresses the dynamic spatial and temporal properties of AMR grids which challenge many existing visualization techniques. Further, we present the first implementation of query-driven visualization on the GPU that uses a GPU-based indexing structure to both answer queries and efficiently utilize GPU memory. We apply our method to two different science domains to demonstrate its broad applicability
Recommended from our members
Visualization and Analysis of 3D Gene Expression Data
Recent methods for extracting precise measurements ofspatial gene expression patterns from three-dimensional (3D) image dataopens the way for new analysis of the complex gene regulatory networkscontrolling animal development. To support analysis of this novel andhighly complex data we developed PointCloudXplore (PCX), an integratedvisualization framework that supports dedicated multi-modal, physical andinformation visualization views along with algorithms to aid in analyzingthe relationships between gene expression levels. Using PCX, we helpedour science stakeholders to address many questions in 3D gene expressionresearch, e.g., to objectively define spatial pattern boundaries andtemporal profiles of genes and to analyze how mRNA patterns arecontrolled by their regulatory transcription factors
- …