209 research outputs found
Scalable Real-Time Rendering for Extremely Complex 3D Environments Using Multiple GPUs
In 3D visualization, real-time rendering of high-quality meshes in complex 3D environments is still one of the major challenges in computer graphics. New data acquisition techniques like 3D modeling and scanning have drastically increased the requirement for more complex models and the demand for higher display resolutions in recent years. Most of the existing acceleration techniques using a single GPU for rendering suffer from the limited GPU memory budget, the time-consuming sequential executions, and the finite display resolution. Recently, people have started building commodity workstations with multiple GPUs and multiple displays. As a result, more GPU memory is available across a distributed cluster of GPUs, more computational power is provided throughout the combination of multiple GPUs, and a higher display resolution can be achieved by connecting each GPU to a display monitor (resulting in a tiled large display configuration). However, using a multi-GPU workstation may not always give the desired rendering performance due to the imbalanced rendering workloads among GPUs and overheads caused by inter-GPU communication.
In this dissertation, I contribute a multi-GPU multi-display parallel rendering approach for complex 3D environments. The approach has the capability to support a high-performance and high-quality rendering of static and dynamic 3D environments. A novel parallel load balancing algorithm is developed based on a screen partitioning strategy to dynamically balance the number of vertices and triangles rendered by each GPU. The overhead of inter-GPU communication is minimized by transferring only a small amount of image pixels rather than chunks of 3D primitives with a novel frame exchanging algorithm. The state-of-the-art parallel mesh simplification and GPU out-of-core techniques are integrated into the multi-GPU multi-display system to accelerate the rendering process
Optimization techniques for a 2D Engine
This document contains the description of the development of a small 2D
engine written in C++. The focus of this project is to implement various
optimization techniques to have a performant application.
The result is a small 2D engine that can be executed in any Windows
machine and can handle more than 10.000 entities interacting with each
other in real time.
The source code for this project is public and under the MIT License and
can be found in the Github repository in the following link:
https://github.com/DavidTello1/2D-Rendere
3DRepo4Unity: Dynamic Loading of Version Controlled 3D Assets into the Unity Game Engine
In recent years, Unity has become a popular platform for the development of a broad range of visualization and VR applications. This is due to its ease of use, cross-platform compatibility and accessibility to independent developers. Despite such applications being cross-platform, their assets are generally bundled with executables, or streamed at runtime in a highly optimised, proprietary format. In this paper, we present a novel system for dynamically populating a Unity environment at runtime using open Web3D standards. Our system generates dynamic resources at runtime from a remote 3D Repo repository. This enables us to build a viewer which can easily visualize X3D-based revisions from a version controlled database in the cloud without any compile-time knowledge of the assets. We motivate the work and introduce the high-level architecture of our solution. We describe our new dynamic transcoding library with an emphasis on scalability and 3D rendering. We then perform a comparative evaluation between 3drepo.io, a state of the art X3DOM based renderer, and the new 3DRepo4Unity library on web browser platforms. Finally, we present a number of different applications that demonstrate the practicality of our chosen approach. By building on previous Web3D functionality and standards, our hope is to stimulate further discussion around and research into web formats that would enable incremental loading on other platforms
Space-optimized texture atlases
Texture atlas parameterization provides an effective way to map a variety of colour and data attributes from 2D texture domains onto polygonal surface meshes. Most of the existing literature focus on how
to build seamless texture atlases for continuous photometric detail, but little e ort has been devoted
to devise e cient techniques for encoding self-repeating, uncontinuous signals such as building facades.
We present a perception-based scheme for generating space-optimized texture atlases speci cally
designed for intentionally non-bijective parameterizations. Our scheme combines within-chart tiling
support with intelligent packing and perceptual measures for assigning texture space in accordance to
the amount of information contents of the image and on its saliency. We demonstrate our optimization
scheme in the context of real-time navigation through a gigatexel urban model of an European city.
Our scheme achieves signi cant compression ratios and speed-up factors with visually indistinguishable
results.
We developed a technique that generates space-optimized texture atlases for the particular encoding
of uncontinuous signals projected onto geometry. The scene is partitioned using a texture atlas tree
that contains for each node a texture atlas. The leaf nodes of the tree contain scene geometry. The
level of detail is controlled by traversing the tree and selecting the appropriate texture atlas for a given
viewer position and orientation. In a preprocessing step, textures associated to each texture atlas node
of the tree are packed. Textures are resized according to a given user-de ned texel size and the size
of the geometry that are projected onto. We also use perceptual measures to assign texture space in
accordance to image detail.
We also explore different techniques for supporting texture wrapping of uncontinuous signals, which
involved the development of e cient techniques for compressing texture coordinates via the GPU. Our
approach supports texture ltering and DXTC compression without noticeable artifacts.
We have implemented a prototype version of our space-optimized texture atlases technique and
used it to render the 3D city model of Barcelona achieving interactive rendering frame rates. The
whole model was composed by more than three million triangles and contained more than twenty thousand different textures representing the building facades with an average original resolution of 512 pixels per texture. Our scheme achieves up 100:1 compression ratios and speed-up factors
of 20 with visually indistinguishable results
Data compression and transmission aspects of panoramic videos
Panoramic videos are effective means for representing static or dynamic scenes along predefined paths. They allow users to change their viewpoints interactively at points in time or space defined by the paths. High-resolution panoramic videos, while desirable, consume a significant amount of storage and bandwidth for transmission. They also make real-time decoding computationally very intensive. This paper proposes efficient data compression and transmission techniques for panoramic videos. A high-performance MPEG-2-like compression algorithm, which takes into account the random access requirements and the redundancies of panoramic videos, is proposed. The transmission aspects of panoramic videos over cable networks, local area networks (LANs), and the Internet are also discussed. In particular, an efficient advanced delivery sharing scheme (ADSS) for reducing repeated transmission and retrieval of frequently requested video segments is introduced. This protocol was verified by constructing an experimental VOD system consisting of a video server and eight Pentium 4 computers. Using the synthetic panoramic video Village at a rate of 197 kb/s and 7 f/s, nearly two-thirds of the memory access and transmission bandwidth of the video server were saved under normal network traffic.published_or_final_versio
Coprocessor integration for real-time event processing in particle physics detectors
Els experiments de fÃsica d’altes energies actuals disposen d’acceleradors amb més energÃa, sensors més precisos i formes més flexibles de recopilar les dades. Aquesta rà pida evolució requereix de més capacitat de cà lcul; els processadors massivament paral·lels, com ara les targes acceleradores grà fiques, ens posen a l’abast aquesta major capacitat de cà lcul a un cost sensiblement inferior a les CPUs tradicionals. L’ús d’aquest tipus de processadors requereix, però, de nous algoritmes i nous enfocaments de l’organització de les dades que són difÃcils d’integrar en els programaris actuals.
En aquest treball s’exploren els problemes derivats de l’ús d’algoritmes paral·lels en els entorns de programari existents, orientats a CPUs, i es proposa una solució, en forma de servei, que comunica amb els diversos pipelines que processen els esdeveniments procedents de les col·lisions de partÃcules, recull les dades en lots i els envia als algoritmes corrent sobre els processadors massivament paral·lels.
Aquest servei s’integra en Gaudà - l’entorn de software de dos dels quatre experiments principals del Gran Col·lisionador d’Hadrons. S’examina el sobrecost que el servei afegeix als algoritmes paral·lels. S’estudia un cas d´ùs del servei per fer una reconstrucció paral·lela de les traces detectades en el VELO Pixel, el subdetector encarregat de la detecció de vèrtex en l’upgrade de LHCb. Per aquest cas, s’observen les caracterÃstiques del rendiment en funció de la mida dels lots de dades. Finalment, les conclusions en posen en el context dels requeriments del sistema de trigger de LHCb.La fÃsica de altas energÃas dispone actualmente de aceleradores con energÃas mayores, sensores más precisos y métodos de recopilación de datos más flexibles que nunca. Su rápido progreso necesita aún más potencia de cálculo; el hardware masivamente paralelo, como las unidades de procesamiento gráfico, nos brinda esta potencia a un coste mucho más bajo que las CPUs tradicionales. Sin embargo, para usar eficientemente este hardware necesitamos algoritmos nuevos y nuevos enfoques de organización de datos difÃciles de integrarse con el software existente.
En este trabajo, se investiga cómo se pueden usar estos algoritmos paralelos en las infraestructuras de software ya existentes y que están orientadas a CPUs. Se propone una solución en forma de un servicio que comunica con los diversos pipelines que procesan los eventos de las correspondientes colisiones de particulas, reúne los datos en lotes y se los entrega a los algoritmos paralelos acelerados por hardware.
Este servicio se integra con Gaudà — la infraestructura del entorno de software que usan dos de los cuatro gran experimentos del Gran Colisionador de Hadrones. Se examinan los costes añadidos por el servicio en los algoritmos paralelos. Se estudia un caso de uso del servicio para ejecutar un algoritmo paralelo para el VELO Pixel (el subdetector encargado de la localización de vértices en el upgrade del experimento LHCb) y se estudian las caracterÃsticas de rendimiento de los distintos tamaños de lotes de datos. Finalmente, las conclusiones se contextualizan dentro la perspectiva de los requerimientos para el sistema de trigger de LHCb.High-energy physics experiments today have higher energies, more accurate sensors, and more flexible means of data collection than ever before. Their rapid progress requires ever more computational power; and massively parallel hardware, such as graphics cards, holds the promise to provide this power at a much lower cost than traditional CPUs. Yet, using this hardware requires new algorithms and new approaches to organizing data that can be difficult to integrate with existing software.
In this work, I explore the problem of using parallel algorithms within existing CPU-orientated frameworks and propose a compromise between the different trade-offs. The solution is a service that communicates with multiple event-processing pipelines, gathers data into batches, and submits them to hardware-accelerated parallel algorithms.
I integrate this service with Gaudi — a framework underlying the software environments of two of the four major experiments at the Large Hadron Collider. I examine the overhead the service adds to parallel algorithms. I perform a case study of using the service to run a parallel track reconstruction algorithm for the LHCb experiment's prospective VELO Pixel subdetector and look at the performance characteristics of using different data batch sizes. Finally, I put the findings into perspective within the context of the LHCb trigger's requirements
GPU-optimized approaches to molecular docking-based virtual screening in drug discovery: A comparative analysis
Finding a novel drug is a very long and complex procedure. Using computer simulations, it is possible to accelerate the preliminary phases by performing a virtual screening that filters a large set of drug candidates to a manageable number. This paper presents the implementations and comparative analysis of two GPU-optimized implementations of a virtual screening algorithm targeting novel GPU architectures. This work focuses on the analysis of parallel computation patterns and their mapping onto the target architecture. The first method adopts a traditional approach that spreads the computation for a single molecule across the entire GPU. The second uses a novel batched approach that exploits the parallel architecture of the GPU to evaluate more molecules in parallel. Experimental results showed a different behavior depending on the size of the database to be screened, either reaching a performance plateau sooner or having a more extended initial transient period to achieve a higher throughput (up to 5x), which is more suitable for extreme-scale virtual screening campaigns
High performance bioinformatics and computational biology on general-purpose graphics processing units
Bioinformatics and Computational Biology (BCB) is a relatively new
multidisciplinary field which brings together many aspects of the fields of
biology, computer science, statistics, and engineering. Bioinformatics extracts
useful information from biological data and makes these more intuitive and
understandable by applying principles of information sciences, while
computational biology harnesses computational approaches and technologies
to answer biological questions conveniently. Recent years have seen an
explosion of the size of biological data at a rate which outpaces the rate of
increases in the computational power of mainstream computer technologies,
namely general purpose processors (GPPs). The aim of this thesis is to explore
the use of off-the-shelf Graphics Processing Unit (GPU) technology in the high
performance and efficient implementation of BCB applications in order to meet
the demands of biological data increases at affordable cost.
The thesis presents detailed design and implementations of GPU solutions for
a number of BCB algorithms in two widely used BCB applications, namely
biological sequence alignment and phylogenetic analysis. Biological sequence
alignment can be used to determine the potential information about a newly
discovered biological sequence from other well-known sequences through
similarity comparison. On the other hand, phylogenetic analysis is concerned
with the investigation of the evolution and relationships among organisms,
and has many uses in the fields of system biology and comparative genomics.
In molecular-based phylogenetic analysis, the relationship between species is
estimated by inferring the common history of their genes and then
phylogenetic trees are constructed to illustrate evolutionary relationships
among genes and organisms. However, both biological sequence alignment
and phylogenetic analysis are computationally expensive applications as their computing and memory requirements grow polynomially or even worse with
the size of sequence databases.
The thesis firstly presents a multi-threaded parallel design of the Smith-
Waterman (SW) algorithm alongside an implementation on NVIDIA GPUs. A
novel technique is put forward to solve the restriction on the length of the
query sequence in previous GPU-based implementations of the SW algorithm.
Based on this implementation, the difference between two main task
parallelization approaches (Inter-task and Intra-task parallelization) is
presented. The resulting GPU implementation matches the speed of existing
GPU implementations while providing more flexibility, i.e. flexible length of
sequences in real world applications. It also outperforms an equivalent GPPbased
implementation by 15x-20x. After this, the thesis presents the first
reported multi-threaded design and GPU implementation of the Gapped
BLAST with Two-Hit method algorithm, which is widely used for aligning
biological sequences heuristically. This achieved up to 3x speed-up
improvements compared to the most optimised GPP implementations.
The thesis then presents a multi-threaded design and GPU implementation of
a Neighbor-Joining (NJ)-based method for phylogenetic tree construction and
multiple sequence alignment (MSA). This achieves 8x-20x speed up compared
to an equivalent GPP implementation based on the widely used ClustalW
software. The NJ method however only gives one possible tree which strongly
depends on the evolutionary model used. A more advanced method uses
maximum likelihood (ML) for scoring phylogenies with Markov Chain Monte
Carlo (MCMC)-based Bayesian inference. The latter was the subject of another
multi-threaded design and GPU implementation presented in this thesis,
which achieved 4x-8x speed up compared to an equivalent GPP
implementation based on the widely used MrBayes software.
Finally, the thesis presents a general evaluation of the designs and
implementations achieved in this work as a step towards the evaluation of
GPU technology in BCB computing, in the context of other computer technologies including GPPs and Field Programmable Gate Arrays (FPGA)
technology
- …