Search CORE

357 research outputs found

High-performance SIMT code generation in an active visual effects library

Author
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2009
Field of study

A metadata-enhanced framework for high performance visual effects

Author: Cornwall Jay L. T.
Cornwall Jay L. T.
Publication venue: Computing, Imperial College London
Publication date: 01/08/2010
Field of study

This thesis is devoted to reducing the interactive latency of image processing computations in visual effects. Film and television graphic artists depend upon low-latency feedback to receive a visual response to changes in effect parameters. We tackle latency with a domain-specific optimising compiler which leverages high-level program metadata to guide key computational and memory hierarchy optimisations. This metadata encodes static and dynamic information about data dependence and patterns of memory access in the algorithms constituting a visual effect – features that are typically difficult to extract through program analysis – and presents it to the compiler in an explicit form. By using domain-specific information as a substitute for program analysis, our compiler is able to target a set of complex source-level optimisations that a vendor compiler does not attempt, before passing the optimised source to the vendor compiler for lower-level optimisation. Three key metadata-supported optimisations are presented. The first is an adaptation of space and schedule optimisation – based upon well-known compositions of the loop fusion and array contraction transformations – to the dynamic working sets and schedules of a runtimeparameterised visual effect. This adaptation sidesteps the costly solution of runtime code generation by specialising static parameters in an offline process and exploiting dynamic metadata to adapt the schedule and contracted working sets at runtime to user-tunable parameters. The second optimisation comprises a set of transformations to generate SIMD ISA-augmented source code. Our approach differs from autovectorisation by using static metadata to identify parallelism, in place of data dependence analysis, and runtime metadata to tune the data layout to user-tunable parameters for optimal aligned memory access. The third optimisation comprises a related set of transformations to generate code for SIMT architectures, such as GPUs. Static dependence metadata is exploited to guide large-scale parallelisation for tens of thousands of in-flight threads. Optimal use of the alignment-sensitive, explicitly managed memory hierarchy is achieved by identifying inter-thread and intra-core data sharing opportunities in memory access metadata. A detailed performance analysis of these optimisations is presented for two industrially developed visual effects. In our evaluation we demonstrate up to 8.1x speed-ups on Intel and AMD multicore CPUs and up to 6.6x speed-ups on NVIDIA GPUs over our best hand-written implementations of these two effects. Programmability is enhanced by automating the generation of SIMD and SIMT implementations from a single programmer-managed scalar representation

Spiral - Imperial College Digital Repository

Design space exploration for GPU-based architecture

Author: Ye Z.
Publication venue
Publication date: 01/01/2009
Field of study

Repository TU/e

Pure OAI Repository

GPU accelerated Monte Carlo simulation of Brownian motors dynamics with CUDA

Author: Gardiner
Reimann
Hänggi
Jülicher
Kay
Hänggi
Binder
Kloeden
Platen
Januszewski
Seibert
Barros
Polyakov
Januszewski
Hänggi
Astumian
Hänggi
Risken
Łuczka
Hänggi
Spiechowicz
Spiechowicz
Spiechowicz
Czernik
Łuczka
Kula
Kula
Kostur
Łuczka
Kula
Kim
Grigoriu
Palleschi
Kim
Publication venue: 'Elsevier BV'
Publication date: 01/01/2004
Field of study

This work presents an updated and extended guide on methods of a proper acceleration of the Monte Carlo integration of stochastic differential equations with the commonly available NVIDIA Graphics Processing Units using the CUDA programming environment. We outline the general aspects of the scientific computing on graphics cards and demonstrate them with two models of a well known phenomenon of the noise induced transport of Brownian motors in periodic structures. As a source of fluctuations in the considered systems we selected the three most commonly occurring noises: the Gaussian white noise, the white Poissonian noise and the dichotomous process also known as a random telegraph signal. The detailed discussion on various aspects of the applied numerical schemes is also presented. The measured speedup can be of the astonishing order of about 3000 when compared to a typical CPU. This number significantly expands the range of problems solvable by use of stochastic simulations, allowing even an interactive research in some cases.Comment: 21 pages, 5 figures; Comput. Phys. Commun., accepted, 201

arXiv.org e-Print Archive

CiteSeerX

Crossref

University of Birmingham Research Portal

Field-Configurable GPU

Author: Pedro Rodrigues de Castro
Publication venue
Publication date: 09/07/2021
Field of study

Nesta dissertação pretende-se desenvolver uma arquitetura de processamento dedicada destinada à aceleração de aplicações específicas, inspirada na estrutura de unidades de processamento do tipo GPU. A unidade de processamento deverá ser programável e configurável para os requisitos de aplicações específicas, sendo adaptada aos tipos e à quantidade de recursos lógicos disponíveis num dispositivo FPGA selecionado. Pretende-se que o acelerador consiga tirar o máximo partido dos recursos disponíveis num determinado dispositivo FPGA (memória, unidades aritméticas, recursos lógicos) com o objetivo de maximizar o desempenho de aplicações selecionadas. Serão consideradas aplicações alvo no domínio do processamento de imagem e de "machine learning". Uma vez selecionada uma arquitetura base, a especialização para uma aplicação (ou classe de aplicações) terá por base a configuração de 3 componentes fundamentais: organização do sistema de memória distribuída (construído com os blocos de memória RAM internos da FPGA), organização das unidades de processamento aritmético (que podem ser heterogéneas) e dimensão dos caminhos de dados. O sistema a desenvolver deverá ser desenhado ao nível RTL, em Verilog, e contemplar um processo automatizado para personalizar o acelerador a partir de um conjunto de especificações definidas com base nas características da aplicação alvo. Esse processo de personalização poderá ser feito com base na definição de parâmetros em Verilog, ou também recorrendo a aplicações dedicadas, a desenvolver, para gerar diretamente código Verilog. Deverá também ser desenvolvido um conjunto elementar de ferramentas de suporte, nomeadamente para geração do código a executar pelo processador. Como validação final, pretende-se integrar e demonstrar o acelerador num sistema de processamento de imagem em tempo real

Repositório Aberto da Universidade do Porto

GPU optimizations for a production molecular docking code

Author: Landaverde Raphael J.
Publication venue: Boston University
Publication date: 01/01/2014
Field of study

Thesis (M.Sc.Eng.) -- Boston UniversityScientists have always felt the desire to perform computationally intensive tasks that surpass the capabilities of conventional single core computers. As a result of this trend, Graphics Processing Units (GPUs) have come to be increasingly used for general computation in scientific research. This field of GPU acceleration is now a vast and mature discipline. Molecular docking, the modeling of the interactions between two molecules, is a particularly computationally intensive task that has been the subject of research for many years. It is a critical simulation tool used for the screening of protein compounds for drug design and in research of the nature of life itself. The PIPER molecular docking program was previously accelerated using GPUs, achieving a notable speedup over conventional single core implementation. Since its original release the development of the CPU based PIPER has not ceased, and it is now a mature and fast parallel code. The GPU version, however, still contains many potential points for optimization. In the current work, we present a new version of GPU PIPER that attains a 3.3x speedup over a parallel MPI version of PIPER running on an 8 core machine and using the optimized Intel Math Kernel Library. We achieve this speedup by optimizing existing kernels for modern GPU architectures and migrating critical code segments to the GPU. In particular, we both improve the runtime of the filtering and scoring stages by more than an order of magnitude, and move all molecular data permanently to the GPU to improve data locality. This new speedup is obtained while retaining a computational accuracy virtually identical to the CPU based version. We also demonstrate that, due to the algorithmic dependencies of the PIPER algorithm on the 3D Fast Fourier Transform, our GPU PIPER will likely remain proportionally faster than equivalent CPU based implementations, and with little room for further optimizations. This new GPU accelerated version of PIPER is integrated as part of the ClusPro molecular docking and analysis server at Boston University. ClusPro has over 4000 registered users and more than 50000 jobs run over the past 4 years

Boston University Institutional Repository (OpenBU)

Recommended from our members

AN ARCHITECTURE EVALUATION AND IMPLEMENTATION OF A SOFT GPGPU FOR FPGAs

Author: Andryc Kevin
Publication venue: ScholarWorks@UMass Amherst
Publication date: 25/10/2018
Field of study

Embedded and mobile systems must be able to execute a variety of different types of code, often with minimal available hardware. Many embedded systems now come with a simple processor and an FPGA, but not more energy-hungry components, such as a GPGPU. In this dissertation we present FlexGrip, a soft architecture which allows for the execution of GPGPU code on an FPGA without the need to recompile the design. The architecture is optimized for FPGA implementation to effectively support the conditional and thread-based execution characteristics of GPGPU execution without FPGA design recompilation. This architecture supports direct CUDA compilation to a binary which is executable on the FPGA-based GPGPU. Our architecture is customizable, thus providing the FPGA designer with a selection of GPGPU cores which display performance versus area tradeoffs. This dissertation describes the FlexGrip architecture in detail and showcases the benefits by evaluating the design for a collection of five standard CUDA benchmarks which are compiled using standard GPGPU compilation tools. Speedups of 23x, on average, versus a MicroBlaze microprocessor are achieved for designs which take advantage of the conditional execution capabilities offered by FlexGrip. We also show FlexGrip can achieve an 80% average reduction of dynamic energy versus the MicroBlaze microprocessor. The dissertation furthers discussion by exploring application-customized versions of the soft GPGPU, thus exploiting the overlay architecture. We expand the architecture to multiple processors per GPGPU and optimizing away features which are not needed by certain classes of applications. These optimizations, which include the effective use of block RAMs and DSP blocks, are critical to the performance of FlexGrip. By implementing a 2 GPGPU design, we show speedups of 44x on average versus a MicroBlaze microprocessor. Application-customized versions of the soft GPGPU can be used to further reduce dynamic energy consumption by an average of 14%. To complete this thesis, we augmented a GPGPU cycle accurate simulator to emulate FlexGrip and evaluate different levels of cache design spaces. We show performance increases for select benchmarks, however, we also show that 64% and 45% of benchmarks exhibited performance decreases when L1D cache was enabled for the 1 SMP and 2 SMP configurations, and only one benchmark showed performance improvement when the L2 cache was enabled

ScholarWorks@UMass Amherst

Working With Incremental Spatial Data During Parallel (GPU) Computation

Author: Chisholm Robert
Publication venue: 'University of Sheffield Conference Proceedings'
Publication date: 29/10/2019
Field of study

Central to many complex systems, spatial actors require an awareness of their local environment to enable behaviours such as communication and navigation. Complex system simulations represent this behaviour with Fixed Radius Near Neighbours (FRNN) search. This algorithm allows actors to store data at spatial locations and then query the data structure to find all data stored within a fixed radius of the search origin. The work within this thesis answers the question: What techniques can be used for improving the performance of FRNN searches during complex system simulations on Graphics Processing Units (GPUs)? It is generally agreed that Uniform Spatial Partitioning (USP) is the most suitable data structure for providing FRNN search on GPUs. However, due to the architectural complexities of GPUs, the performance is constrained such that FRNN search remains one of the most expensive common stages between complex systems models. Existing innovations to USP highlight a need to take advantage of recent GPU advances, reducing the levels of divergence and limiting redundant memory accesses as viable routes to improve the performance of FRNN search. This thesis addresses these with three separate optimisations that can be used simultaneously. Experiments have assessed the impact of optimisations to the general case of FRNN search found within complex system simulations and demonstrated their impact in practice when applied to full complex system models. Results presented show the performance of the construction and query stages of FRNN search can be improved by over 2x and 1.3x respectively. These improvements allow complex system simulations to be executed faster, enabling increases in scale and model complexity

White Rose E-theses Online