9 research outputs found

    DPP-PMRF: Rethinking Optimization for a Probabilistic Graphical Model Using Data-Parallel Primitives

    Full text link
    We present a new parallel algorithm for probabilistic graphical model optimization. The algorithm relies on data-parallel primitives (DPPs), which provide portable performance over hardware architecture. We evaluate results on CPUs and GPUs for an image segmentation problem. Compared to a serial baseline, we observe runtime speedups of up to 13X (CPU) and 44X (GPU). We also compare our performance to a reference, OpenMP-based algorithm, and find speedups of up to 7X (CPU).Comment: LDAV 2018, October 201

    Fast Collision Culling in Large-Scale Environments Using GPU Mapping Function

    Get PDF
    International audienceThis paper presents a novel and efficient GPU-based parallel algorithm to cull non-colliding object pairs in very large-scale dynamic simulations. It allows to cull objects in less than 25ms with more than 100K objects. It is designed for many-core GPU and fully exploits multi-threaded capabilities and data-parallelism. In order to take advantage of the high number of cores, a new mapping function is defined that enables GPU threads to determine the objects pair to compute without any global memory access. These new optimized GPU kernel functions use the thread indexes and turn them into a unique pair of objects to test. A square root approximation technique is used based on Newton's estimation, enabling the threads to only perform a few atomic operations. A first characterization of the approximation errors is presented, enabling the fixing of incorrect computations. The I/O GPU streams are optimized using binary masks. The implementation and evaluation is made on largescale dynamic rigid body simulations. The increase in speed is highlighted over other recently proposed CPU and GPU-based techniques. The comparison shows that our system is, in most cases, faster than previous approaches

    Is Smaller Always Better? - Evaluating Video Compression Techniques for Simulation Ensembles

    Get PDF
    We provide an evaluation of the applicability of video compression techniques for compressing visualization image databases that are often used for in situ visualization. Considering relevant practical implementation aspects, we identify relevant compression parameters, and evaluate video compression for several test cases, involving several data sets and visualization methods; we use three different video codecs. To quantify the benefits and drawbacks of video compression, we employ metrics for image quality, compression rate, and performance. The experiments discussed provide insight into good choices of parameter values, working well in the considered cases

    Shift-Based Parallel Image Compositing on InfiniBand Fat-Trees

    Get PDF
    International audienceParallel image compositing has been widely studied over the past 20 years, as this is one, if not the most, crucial element in the implementation of a scalable parallel rendering system. Many algorithms have been proposed and implemented on a large variety of supercomputers. Among the existing supercomputers, InfiniBandTM (IB) PC clusters, and their associated fat-tree topology, are clearly becoming the dominant architecture, as they provide the scalability, high bandwidth and low latency required by the most demanding parallel applications. Surprisingly, very few efforts have been devoted to the implementation and performance evaluation of parallel image compositing algorithms on this kind of architecture. We propose in this paper a new parallel image compositing algorithm, called Shift-Based, relying on a well-known communication pattern called shift permutation. Indeed, shift permutation is one of the possible ways to get the maximum cross bisectional bandwidth provided by an IB fat-tree cluster. We show that our Shift-Based algorithm scales on any number of processing nodes (with peak performance on specific counts), allows overlapping communications with computations and exhibits contention-free network communications. This is demonstrated with the image compositing of very high resolution images at interactive frame rates

    Spatial CPU-GPU data structures for interactive rendering of large particle data

    Get PDF
    In this work, I investigate the interactive visualization of arbitrarily large particle data sets which ft into system memory, but not into GPU memory. With conventional rendering techniques, interactivity of visualizations is drastically reduced when rendering tens- or hundreds of millions of objects. At the same time, graphics hardware memory capabilities limit the size of data sets which can be placed in GPU memory for rendering. To circumvent these obstacles, a progressive rendering approach is employed, which gradually streams and renders all particle data to the GPU without reducing or altering the particle data itself. The particle data is rendered according to a visibility sorting derived from occlusion relations between different parts of the data set, leading to a rendering order of scene contents guided by importance for the rendered image. I analyze and compare possible implementation choices for rendering particles as opaque spheres in OpenGL, which forms the basis of the particle rendering application developed within this work. The application utilizes a multi-threaded architecture, where data preprocessing on a CPU-thread and a rendering algorithm on a GPU-thread ensure that the user can interact with the application at any time. In particular it is guaranteed that the user can explore the particle data interactively, by ensuring minimal latency from user input to seeing the effects of that input. This is achieved by favoring user inputs over completeness of the rendered image at all stages during rendering. At the same time the user is provided with an immediate feedback about interactions by re-projecting all currently visible particles to the next rendered image. The re-projection is realized with an on-GPU particle-cache of visible particles that is built during particle data streaming and rendering, and drawn upon user interaction using the most recent camera confguration according to user inputs. The combination of the developed techniques allows interactive exploration of particle data sets with up to 1.5 billion particles on a commodity computer.In dieser Arbeit wird die interaktive Visualisierung beliebig großer Partikeldaten untersucht, wobei die Partikeldaten im Arbeitsspeicher hinterlegt sind, aber nicht zwangsläufig in den Grafikspeicher passen. Mit üblichen Rendering Methoden büßen Visualisierungen drastisch an Interaktivität ein, wenn mehrere zehn- bis hunderte Millionen Objekte dargestellt werden. Gleichzeitig ist die Größe möglicher zu visualisierender Datensätze begrenzt durch den Videospeicher von Grafikkarten, auf dem zu visualisierende Daten vorliegen müssen. Um diese Einschränkungen zu umgehen, wird in dieser Arbeit ein progressiver Rendering Ansatz verfolgt, der sukzessive alle Partikeldaten zur Grafikkarte hochlädt und rendert, ohne die Partikeldaten zu reduzieren oder anderweitig zu verändern. Die Partikeldaten werden entsprechend einer vorgenommenen Sichtbarkeitssortierung gerendert, die aus gegenseitigen Verdeckungen verschiedener Teile des Partikeldatensatzes berechnet wird. Dies führt dazu, dass Teile der Szene nach ihrer Wichtigkeit für das aktuelle Bild sortiert und dargestellt werden. Es werden verschiedene Möglichkeiten analysiert und verglichen, Partikel als opake Kugeln in OpenGL zu rendern. Dies formt die Grundlage für die Partikel-Rendering Software, die in dieser Arbeit entwickelt wurde. Die Architektur der Rendering-Software benutzt mehrere Threads, sodass durch eine Daten-Vorverarbeitung auf einem CPUThread und durch Rendering-Algorithmen auf einem GPU-Thread sichergestellt ist, dass der Benutzer mit der Software jederzeit interagieren kann. Insbesondere ist sichergestellt, dass der Benutzer die Partikeldaten interaktiv untersuchen kann, indem die Latenz zwischen Benutzereingaben und dem Anzeigen der daraus resultierenden Veränderungen minimal gehalten wird. Dies wird erreicht indem der Verarbeitung von Benutzereingaben an allen Stellen des Rendering-Prozesses höhere Priorität eingeräumt wird als der Vollständigkeit des gerenderten Bildes. Gleichzeitig wird dem Benutzer eine sofortige Rückmeldung über getätigte Benutzereingaben gegeben, indem alle sichtbaren Partikel in das nächste gerenderte Bild neu projeziert werden. Diese Neu-Projektion wird durch einen GPU-seitigen Partikel-Cache aller aktuell sichtbaren Partikel realisiert, der während des sukzessiven Partikelstreamings und -renderns aufgebaut wird. Sobald der Benutzer eine Eingabe tätigt, wird der auf der GPU liegende Partikel-Cache unter der aktuellsten benutzerdefinierten Kameraposition neu gerendert. Die Kombination dieser entwickelten Methoden erlaubt ein interaktives Betrachten von Partikeldaten mit bis zu 1,5 Milliarden Partikeln auf einem handelsüblichen Computer

    Efficient evaluation of functionally represented volumetric objects.

    Get PDF
    There are several approaches to representing shapes in computer graphics. One of the ways to describe objects and operations is Function Representation (FRep). In FRep, a geometric object is de ned by a single continuous real-valued function of point coordinates. Generally geometric modelling is conducted in order to achieve visual outcome. In FRep the transformation of a function into a visual representation relies on extensive sampling of the function. The computational cost of the sampling can cause adverse e ects during applications runtime. In this thesis the problem of e cient evaluation of the de ning function is discussed. An observation is made on wide range of operations and primitives within FRep and their suitability for parallelization. Furthermore, a new novel method is proposed to distribute FReps computational workloads on parallel hardware devices such as graphics programming units and multi-core processors

    EXPLORING MULTIPLE LEVELS OF PERFORMANCE MODELING FOR HETEROGENEOUS SYSTEMS

    Get PDF
    The current trend in High-Performance Computing (HPC) is to extract concurrency from clusters that include heterogeneous resources such as General Purpose Graphical Processing Units (GPGPUs) and Field Programmable Gate Array (FPGAs). Although these heterogeneous systems can provide substantial performance for massively parallel applications, much of the available computing resources are often under-utilized due to inefficient application mapping, load balancing, and tuning. While several performance prediction models exist to efficiently tune applications, they often require significant computing architecture knowledge for reliable prediction. In addition, they do not address multiple levels of design space abstraction and it is often difficult to choose a reliable prediction model for a given design. In this research, we develop a multi-level suite of performance prediction models for heterogeneous systems that primarily targets Synchronous Iterative Algorithms (SIAs). The modeling suite aims to produce accurate and straightforward application runtime prediction prior to the actual large-scale implementation. This suite addresses two levels of system abstraction: 1) low-level where partial knowledge of the application implementation is present along with the system specifications and 2) high-level where the implementation details are minimum and only high-level computing system specifications are given. The performance prediction modeling suite is developed using our proposed Synchronous Iterative GPGPU Execution (SIGE) model for GPGPU clusters, motivated by the RC Amenability Test for Scalable Systems (RATSS) model for FPGA clusters. The low-level abstraction for GPGPU clusters consists of a regression-based performance prediction framework that statistically abstracts system architecture characteristics, enabling performance prediction without detailed architecture knowledge. In this framework, the overall execution time of an application is predicted using regression models developed for host-device computations and network-level communications performed in the algorithm. We have used a family of Spiking Neural Network (SNN) models and an Anisotropic Diffusion Filter (ADF) algorithm as SIA case studies for verification of the regression-based framework and achieved over 90% prediction accuracy compared to the actual implementations for several GPGPU cluster configurations tested. The results establish the adequacy of the low-level abstraction model for advanced, fine-grained performance prediction and design space exploration (DSE). The high-level abstraction consists of the following two primary modeling approaches: qualitative modeling that uses existing subjective-analytical models for computation and communication; and quantitative modeling that predicts computation and communication performance by measuring hardware events associated with objective-analytical models using micro-benchmarks. The performance prediction provided by the high-level abstraction approaches, albeit coarse-grained, delivers useful insight into application performance on the chosen heterogeneous system. A blend of the two high-level modeling approaches, labeled as hybrid modeling, is explored for insightful preliminary performance prediction. The performance prediction models in the multi-level suite are verified and compared for their accuracy and ease-of-use, allowing developers to choose a model that best satisfies their design space abstraction. We also construct a roadmap that guides user from optimal Application-to-Accelerator (A2A) mapping to fine-grained performance prediction, thereby providing a hierarchical approach to optimal application porting on the target heterogeneous system. The end goal of this dissertation research is to offer the HPC community a thorough, non-architecture specific, performance prediction framework in the form of a hierarchical modeling suite that enables them to optimally utilize the heterogeneous resources
    corecore