50 research outputs found

    Foveated Path Tracing with Fast Reconstruction and Efficient Sample Distribution

    Get PDF
    Polunseuranta on tietokonegrafiikan piirtotekniikka, jota on käytetty pääasiassa ei-reaaliaikaisen realistisen piirron tekemiseen. Polunseuranta tukee luonnostaan monia muilla tekniikoilla vaikeasti saavutettavia todellisen valon ilmiöitä kuten heijastuksia ja taittumista. Reaaliaikainen polunseuranta on hankalaa polunseurannan suuren laskentavaatimuksen takia. Siksi nykyiset reaaliaikaiset polunseurantasysteemi tuottavat erittäin kohinaisia kuvia, jotka tyypillisesti suodatetaan jälkikäsittelykohinanpoisto-suodattimilla. Erittäin immersiivisiä käyttäjäkokemuksia voitaisiin luoda polunseurannalla, joka täyttäisi laajennetun todellisuuden vaatimukset suuresta resoluutiosta riittävän matalassa vasteajassa. Yksi mahdollinen ratkaisu näiden vaatimusten täyttämiseen voisi olla katsekeskeinen polunseuranta, jossa piirron resoluutiota vähennetään katseen reunoilla. Tämän johdosta piirron laatu on katseen reunoilla sekä harvaa että kohinaista, mikä asettaa suuren roolin lopullisen kuvan koostavalle suodattimelle. Tässä työssä esitellään ensimmäinen reaaliajassa toimiva regressionsuodatin. Suodatin on suunniteltu kohinaisille kuville, joissa on yksi polunseurantanäyte pikseliä kohden. Nopea suoritus saavutetaan tiileissä käsittelemällä ja nopealla sovituksen toteutuksella. Lisäksi työssä esitellään Visual-Polar koordinaattiavaruus, joka jakaa polunseurantanäytteet siten, että niiden jakauma seuraa silmän herkkyysmallia. Visual-Polar-avaruuden etu muihin tekniikoiden nähden on että se vähentää työmäärää sekä polunseurannassa että suotimessa. Nämä tekniikat esittelevät toimivan prototyypin katsekeskeisestä polunseurannasta, ja saattavat toimia tienraivaajina laajamittaiselle realistisen reaaliaikaisen polunseurannan käyttöönotolle.Photo-realistic offline rendering is currently done with path tracing, because it naturally produces many real-life light effects such as reflections, refractions and caustics. These effects are hard to achieve with other rendering techniques. However, path tracing in real time is complicated due to its high computational demand. Therefore, current real-time path tracing systems can only generate very noisy estimate of the final frame, which is then denoised with a post-processing reconstruction filter. A path tracing-based rendering system capable of filling the high resolution in the low latency requirements of mixed reality devices would generate a very immersive user experience. One possible solution for fulfilling these requirements could be foveated path tracing, wherein the rendering resolution is reduced in the periphery of the human visual system. The key challenge is that the foveated path tracing in the periphery is both sparse and noisy, placing high demands on the reconstruction filter. This thesis proposes the first regression-based reconstruction filter for path tracing that runs in real time. The filter is designed for highly noisy one sample per pixel inputs. The fast execution is accomplished with blockwise processing and fast implementation of the regression. In addition, a novel Visual-Polar coordinate space which distributes the samples according to the contrast sensitivity model of the human visual system is proposed. The specialty of Visual-Polar space is that it reduces both path tracing and reconstruction work because both of them can be done with smaller resolution. These techniques enable a working prototype of a foveated path tracing system and may work as a stepping stone towards wider commercial adoption of photo-realistic real-time path tracing

    Asynchronous and Multiprecision Linear Solvers - Scalable and Fault-Tolerant Numerics for Energy Efficient High Performance Computing

    Get PDF
    Asynchronous methods minimize idle times by removing synchronization barriers, and therefore allow the efficient usage of computer systems. The implied high tolerance with respect to communication latencies improves the fault tolerance. As asynchronous methods also enable the usage of the power and energy saving mechanisms provided by the hardware, they are suitable candidates for the highly parallel and heterogeneous hardware platforms that are expected for the near future

    Media gateway utilizando um GPU

    Get PDF
    Mestrado em Engenharia de Computadores e Telemátic

    Robust computational intelligence techniques for visual information processing

    Get PDF
    The third part is exclusively dedicated to the super-resolution of Magnetic Resonance Images. In one of these works, an algorithm based on the random shifting technique is developed. Besides, we studied noise removal and resolution enhancement simultaneously. To end, the cost function of deep networks has been modified by different combinations of norms in order to improve their training. Finally, the general conclusions of the research are presented and discussed, as well as the possible future research lines that are able to make use of the results obtained in this Ph.D. thesis.This Ph.D. thesis is about image processing by computational intelligence techniques. Firstly, a general overview of this book is carried out, where the motivation, the hypothesis, the objectives, and the methodology employed are described. The use and analysis of different mathematical norms will be our goal. After that, state of the art focused on the applications of the image processing proposals is presented. In addition, the fundamentals of the image modalities, with particular attention to magnetic resonance, and the learning techniques used in this research, mainly based on neural networks, are summarized. To end up, the mathematical framework on which this work is based on, ₚ-norms, is defined. Three different parts associated with image processing techniques follow. The first non-introductory part of this book collects the developments which are about image segmentation. Two of them are applications for video surveillance tasks and try to model the background of a scenario using a specific camera. The other work is centered on the medical field, where the goal of segmenting diabetic wounds of a very heterogeneous dataset is addressed. The second part is focused on the optimization and implementation of new models for curve and surface fitting in two and three dimensions, respectively. The first work presents a parabola fitting algorithm based on the measurement of the distances of the interior and exterior points to the focus and the directrix. The second work changes to an ellipse shape, and it ensembles the information of multiple fitting methods. Last, the ellipsoid problem is addressed in a similar way to the parabola

    Artificial Intelligence with Light Supervision: Application to Neuroimaging

    Get PDF
    Recent developments in artificial intelligence research have resulted in tremendous success in computer vision, natural language processing and medical imaging tasks, often reaching human or superhuman performance. In this thesis, I further developed artificial intelligence methods based on convolutional neural networks with a special focus on the automated analysis of brain magnetic resonance imaging scans (MRI). I showed that efficient artificial intelligence systems can be created using only minimal supervision, by reducing the quantity and quality of annotations used for training. I applied those methods to the automated assessment of the burden of enlarged perivascular spaces, brain structural changes that may be related to dementia, stroke, mult

    Parallel Asynchronous Matrix Multiplication for a Distributed Pipelined Neural Network

    Get PDF
    Machine learning is an approach to devise algorithms that compute an output without a given rule set but based on a self-learning concept. This approach is of great importance for several fields of applications in science and industry where traditional programming methods are not sufficient. In neural networks, a popular subclass of machine learning algorithms, commonly previous experience is used to train the network and produce good outputs for newly introduced inputs. By increasing the size of the network more complex problems can be solved which again rely on a huge amount of training data. Increasing the complexity also leads to higher computational demand and storage requirements and to the need for parallelization. Several parallelization approaches of neural networks have already been considered. Most approaches use special purpose hardware whilst other work focuses on using standard hardware. Often these approaches target the problem by parallelizing the training data. In this work a new parallelization method named poadSGD is proposed for the parallelization of fully-connected, largescale feedforward networks on a compute cluster with standard hardware. poadSGD is based on the stochastic gradient descent algorithm. A block-wise distribution of the network's layers to groups of processes and a pipelining scheme for batches of the training samples are used. The network is updated asynchronously without interrupting ongoing computations of subsequent batches. For this task a one-sided communication scheme is used. A main algorithmic part of the batch-wise pipelined version consists of matrix multiplications which occur for a special distributed setup, where each matrix is held by a different process group. GASPI, a parallel programming model from the field of "Partitioned Global Address Spaces" (PGAS) models is introduced and compared to other models from this class. As it mainly relies on one-sided and asynchronous communication it is a perfect candidate for the asynchronous update task in the poadSGD algorithm. Therefore, the matrix multiplication is also implemented based GASPI. In order to efficiently handle upcoming synchronizations within the process groups and achieve a good workload distribution, a two-dimensional block-cyclic data distribution is applied for the matrices. Based on this distribution, the multiplication algorithm is computed by diagonally iterating over the sub blocks of the resulting matrix and computing the sub blocks in subgroups of the processes. The sub blocks are computed by sharing the workload between the process groups and communicating mostly in pairs or in subgroups. The communication in pairs is set up to be overlapped by other ongoing computations. The implementations provide a special challenge, since the asynchronous communication routines must be handled with care as to which processor is working at what point in time with which data in order to prevent an unintentional dual use of data. The theoretical analysis shows the matrix multiplication to be superior to a naive implementation when the dimension of the sub blocks of the matrices exceeds 382. The performance achieved in the test runs did not withstand the expectations the theoretical analysis predicted. The algorithm is executed on up to 512 cores and for matrices up to a size of 131,072 x 131,072. The implementation using the GASPI API was found not be straightforward but to provide a good potential for overlapping communication with computations whenever the data dependencies of an application allow for it. The matrix multiplication was successfully implemented and can be used within an implementation of the poadSGD method that is yet to come. The poadSGD method seems to be very promising, especially as nowadays, with the larger amount of data and the increased complexity of the applications, the approaches to parallelization of neural networks are increasingly of interest

    Fast and accurate finite-element multigrid solvers for PDE simulations on GPU clusters

    Get PDF
    Der wichtigste Beitrag dieser Dissertation ist es aufzuzeigen, dass Grafikprozessoren (GPUs) als Repräsentanten der Entwicklung hin zu Vielkern-Architekturen sehr gut geeignet sind zur schnellen und genauen Lösung großer, dünn besetzter linearer Gleichungssysteme, insbesondere mit parallelen Mehrgittermethoden auf heterogenen Rechenclustern. Solche Systeme treten bspw. bei der Diskretisierung (elliptischer) partieller Differentialgleichungen mittels finiter Elemente auf. Wir demonstrieren Beschleunigungsfaktoren von mindestens einer Größenordnung gegenüber konventionellen, hochoptimierten CPU-Implementierungen, ohne Verlust von Genauigkeit und Funktionsumfang. Im Detail liefert diese Dissertation die folgenden Beiträge: Berechnungen in einfach genauer Fließkommadarstellung können für die hier betrachteten Problemklassen nicht ausreichen. Wir greifen die Methode gemischt genauer iterativer Verfeinerung (Nachiteration) wieder auf, um nicht nur die Genauigkeit von berechneten Lösungen zu verbessern, sondern vielmehr die Effizienz des Lösungsprozesses als ganzes zu steigern. Sowohl auf CPUs als auch auf GPUs demonstrieren wir eine deutliche Leistungssteigerung ohne Genauigkeitsverlust im Vergleich zur Berechnung in höherer Fliesskomma-Genauigkeit. Wir präsentieren effiziente Parallelisierungstechniken für Mehrgitter-Löser auf Grafik-Hardware, insbesondere für numerisch starke Glätter und Vorkonditionierer, die für stark anisotrope Gitter und Operatoren geeignet sind. Ein Beispiel ist die Entwicklung einer effizienten Reformulierung des Verfahrens der zyklischen Reduktion für die Lösung tridiagonaler Gleichungssysteme. Im Hinblick auf Hardware-orientierte Numerik analysieren wir sorgfältig den Kompromiss zwischen numerischer und Laufzeit-Effizienz für inexakte Parallelisierungstechniken, die einige der inhärent sequentiellen Charakteristiken solcher starker Glätter zugunsten besserer Parallelisierungseigenschaften entkoppeln. Die Reimplementierung großer, etablierter Softwarepakete zur Anpassung auf neue Hardwareplattformen ist oft inakzeptabel teuer. Wir entwickeln einen "minimalinvasiven" Zugang zur Integration von Co-Prozessoren wie GPUs in FEAST, einem exemplarischen finite Elemente Diskretisierungs- und Löserpaket. Der Hauptvorteil unserer Technik ist, dass Applikationen, die auf FEAST aufsetzen, nicht geändert werden müssen um von der Beschleunigung durch solche Co-Prozessoren zu profitieren. Wir evaluieren unseren Zugang auf großen GPU-beschleunigten Rechenclustern für klassische Benchmarkprobleme aus der linearisierten Elastizität und der Simulation stationärer laminarer Strömungsvorgänge, und beobachten gute Beschleunigungsfaktoren und gute schwache Skalierbarkeit. Die maximal erreichbare Beschleunigung wird zudem analysiert und theoretisch modelliert, um bspw. Vorhersagen treffen zu können. Weiterhin fassen wir die historische Entwicklung des Forschungsgebiets "wissenschaftliches Rechnen auf Grafikhardware" seit 2001/2002 zusammen, d.h. die Entwicklung von GPGPU als obskures Nischenthema hin zum fachübergreifenden Einsatz heute. Die Darstellung umfasst gleichermaßen die Hardware und das Programmiermodell und beinhaltet eine ausgiebige Bibliografie von Veröffentlichungen im Bereich der Simulation von PDE-Problemen auf GPUs.The main contribution of this thesis is to demonstrate that graphics processors (GPUs) as representatives of emerging many-core architectures are very well-suited for the fast and accurate solution of large sparse linear systems of equations, using parallel multigrid methods on heterogeneous compute clusters. Such systems arise for instance in the discretisation of (elliptic) partial differential equations with finite elements. We report on at least one order of magnitude speedup over highly-tuned conventional CPU implementations, without sacrificing neither accuracy nor functionality. In more detail, this thesis includes the following contributions: Single precision floating point computations may be insufficient for the class of problems considered in this thesis. We revisit mixed precision iterative refinement techniques to not only increase the accuracy of computed results, but also to increase the efficiency of the solution process. Both on CPUs and on GPUs, we demonstrate a significant performance improvement without loss of accuracy compared to computing in high precision only. We present efficient parallelisation techniques for multigrid solvers on graphics hardware, in particular for numerically strong smoothers and preconditioners that are suitable for highly anisotropic grids and operators. For instance, an efficient formulation of the cyclic reduction algorithm to solve tridiagonal systems is developed. In view of hardware-oriented numerics, we carefully analyse the trade-off between numerical and runtime performance for inexact parallelisation techniques that decouple some of the inherently sequential characteristics of strong smoothing operators. For large-scale established software frameworks, the re-implementation tailored to novel hardware platforms is often prohibitively expensive. We develop a 'minimally invasive' approach to integrate support for co-processor hardware like GPUs into FEAST, a finite element discretisation and solver toolbox. Our technique has the major advantage that applications built on top of the toolbox do not have to be changed at all to benefit from co-processor acceleration. The approach is evaluated for benchmark problems in linearised elasticity and stationary laminar flow computed on large-scale GPU-enhanced clusters. Good speedup factors and near-ideal weak scalability are observed. The achievable speedup is analysed and a theoretical speedup model is presented. Finally, we provide a historical overview of scientific computing on graphics hardware since the early beginnings in 2001/2002, when GPGPU was an obscure research topic pursued by few, to the widespread adoption nowadays. We discuss the evolution of the hardware and the programming model, and provide a comprehensive bibliography of publications related to PDE simulations on GPUs

    Linear-time encoding and decoding of low-density parity-check codes

    Get PDF
    Low-density parity-check (LDPC) codes had a renaissance when they were rediscovered in the 1990’s. Since then LDPC codes have been an important part of the field of error-correcting codes, and have been shown to be able to approach the Shannon capacity, the limit at which we can reliably transmit information over noisy channels. Following this, many modern communications standards have adopted LDPC codes. Error-correction is equally important in protecting data from corruption on a hard-drive as it is in deep-space communications. It is most commonly used for example for reliable wireless transmission of data to mobile devices. For practical purposes, both encoding and decoding need to be of low complexity to achieve high throughput and low power consumption. This thesis provides a literature review of the current state-of-the-art in encoding and decoding of LDPC codes. Message- passing decoders are still capable of achieving the best error-correcting performance, while more recently considered bit-flipping decoders are providing a low-complexity alternative, albeit with some loss in error-correcting performance. An implementation of a low-complexity stochastic bit-flipping decoder is also presented. It is implemented for Graphics Processing Units (GPUs) in a parallel fashion, providing a peak throughput of 1.2 Gb/s, which is significantly higher than previous decoder implementations on GPUs. The error-correcting performance of a range of decoders has also been tested, showing that the stochastic bit-flipping decoder provides relatively good error-correcting performance with low complexity. Finally, a brief comparison of encoding complexities for two code ensembles is also presented
    corecore