    dissertationAs the base of the software stack, system-level software is expected to provide ecient and scalable storage, communication, security and resource management functionalities. However, there are many computationally expensive functionalities at the system level, such as encryption, packet inspection, and error correction. All of these require substantial computing power. What's more, today's application workloads have entered gigabyte and terabyte scales, which demand even more computing power. To solve the rapidly increased computing power demand at the system level, this dissertation proposes using parallel graphics pro- cessing units (GPUs) in system software. GPUs excel at parallel computing, and also have a much faster development trend in parallel performance than central processing units (CPUs). However, system-level software has been originally designed to be latency-oriented. GPUs are designed for long-running computation and large-scale data processing, which are throughput-oriented. Such mismatch makes it dicult to t the system-level software with the GPUs. This dissertation presents generic principles of system-level GPU computing developed during the process of creating our two general frameworks for integrating GPU computing in storage and network packet processing. The principles are generic design techniques and abstractions to deal with common system-level GPU computing challenges. Those principles have been evaluated in concrete cases including storage and network packet processing applications that have been augmented with GPU computing. The signicant performance improvement found in the evaluation shows the eectiveness and eciency of the proposed techniques and abstractions. This dissertation also presents a literature survey of the relatively young system-level GPU computing area, to introduce the state of the art in both applications and techniques, and also their future potentials

    Accelerated volumetric reconstruction from uncalibrated camera views

    While both work with images, computer graphics and computer vision are inverse problems. Computer graphics starts traditionally with input geometric models and produces image sequences. Computer vision starts with input image sequences and produces geometric models. In the last few years, there has been a convergence of research to bridge the gap between the two fields. This convergence has produced a new field called Image-based Rendering and Modeling (IBMR). IBMR represents the effort of using the geometric information recovered from real images to generate new images with the hope that the synthesized ones appear photorealistic, as well as reducing the time spent on model creation. In this dissertation, the capturing, geometric and photometric aspects of an IBMR system are studied. A versatile framework was developed that enables the reconstruction of scenes from images acquired with a handheld digital camera. The proposed system targets applications in areas such as Computer Gaming and Virtual Reality, from a lowcost perspective. In the spirit of IBMR, the human operator is allowed to provide the high-level information, while underlying algorithms are used to perform low-level computational work. Conforming to the latest architecture trends, we propose a streaming voxel carving method, allowing a fast GPU-based processing on commodity hardware

    GPU Accelerated protocol analysis for large and long-term traffic traces

    This thesis describes the design and implementation of GPF+, a complete general packet classification system developed using Nvidia CUDA for Compute Capability 3.5+ GPUs. This system was developed with the aim of accelerating the analysis of arbitrary network protocols within network traffic traces using inexpensive, massively parallel commodity hardware. GPF+ and its supporting components are specifically intended to support the processing of large, long-term network packet traces such as those produced by network telescopes, which are currently difficult and time consuming to analyse. The GPF+ classifier is based on prior research in the field, which produced a prototype classifier called GPF, targeted at Compute Capability 1.3 GPUs. GPF+ greatly extends the GPF model, improving runtime flexibility and scalability, whilst maintaining high execution efficiency. GPF+ incorporates a compact, lightweight registerbased state machine that supports massively-parallel, multi-match filter predicate evaluation, as well as efficient arbitrary field extraction. GPF+ tracks packet composition during execution, and adjusts processing at runtime to avoid redundant memory transactions and unnecessary computation through warp-voting. GPF+ additionally incorporates a 128-bit in-thread cache, accelerated through register shuffling, to accelerate access to packet data in slow GPU global memory. GPF+ uses a high-level DSL to simplify protocol and filter creation, whilst better facilitating protocol reuse. The system is supported by a pipeline of multi-threaded high-performance host components, which communicate asynchronously through 0MQ messaging middleware to buffer, index, and dispatch packet data on the host system. The system was evaluated using high-end Kepler (Nvidia GTX Titan) and entry level Maxwell (Nvidia GTX 750) GPUs. The results of this evaluation showed high system performance, limited only by device side IO (600MBps) in all tests. GPF+ maintained high occupancy and device utilisation in all tests, without significant serialisation, and showed improved scaling to more complex filter sets. Results were used to visualise captures of up to 160 GB in seconds, and to extract and pre-filter captures small enough to be easily analysed in applications such as Wireshark


    Abstract As the number of transistors that are integrated onto a silicon die continues to increase, the compute power is becoming a commodity. This has enabled a whole host of new applications that rely on high-throughput computations. Recently, the need for faster and cost-effective applications in form-factor constrained environments has driven an interest in on-chip acceleration of algorithms based on Monte Carlo simula- Processor. Futhermore, we have created a framework for further increasing parallelism by scaling our architecture across multiple compute devices and by extending our original design to a multi-FPGA system nearly linear increase in acceleration with logic resources was achieved. iv Acknowledgements One could hardly put into words the contributions made to this work by the many wonderful people who surround me on a daily basis. I count myself blessed to have family, friends and colleagues that support and encourage me and to recognize each individually would be impossible. Nonetheless, there are some people without whose explicit mention this thesis would be incomplete

    Hardware Acceleration Using Functional Languages

    Cílem této práce je prozkoumat možnosti využití funkcionálního paradigmatu pro hardwarovou akceleraci, konkrétně pro datově paralelní úlohy. Úroveň abstrakce tradičních jazyků pro popis hardwaru, jako VHDL a Verilog, přestáví stačit. Pro popis na algoritmické či behaviorální úrovni se rozmáhají jazyky původně navržené pro vývoj softwaru a modelování, jako C/C++, SystemC nebo MATLAB. Funkcionální jazyky se s těmi imperativními nemůžou měřit v rozšířenosti a oblíbenosti mezi programátory, přesto je předčí v mnoha vlastnostech, např. ve verifikovatelnosti, schopnosti zachytit inherentní paralelismus a v kompaktnosti kódu. Pro akceleraci datově paralelních výpočtů se často používají jednotky FPGA, grafické karty (GPU) a vícejádrové procesory. Praktická část této práce rozšiřuje existující knihovnu Accelerate pro počítání na grafických kartách o výstup do VHDL. Accelerate je možno chápat jako doménově specifický jazyk vestavěný do Haskellu s backendem pro prostředí NVIDIA CUDA. Rozšíření pro vysokoúrovňovou syntézu obvodů ve VHDL představené v této práci používá stejný jazyk a frontend.The aim of this thesis is to research how the functional paradigm can be used for hardware acceleration with an emphasis on data-parallel tasks. The level of abstraction of the traditional hardware description languages, such as VHDL or Verilog, is becoming to low. High-level languages from the domains of software development and modeling, such as C/C++, SystemC or MATLAB, are experiencing a boom for hardware description on the algorithmic or behavioral level. Functional Languages are not so commonly used, but they outperform imperative languages in verification, the ability to capture inherent paralellism and the compactness of code. Data-parallel task are often accelerated on FPGAs, GPUs and multicore processors. In this thesis, we use a library for general-purpose GPU programs called Accelerate and extend it to produce VHDL. Accelerate is a domain-specific language embedded into Haskell with a backend for the NVIDIA CUDA platform. We use the language and its frontend, and create a new backend for high-level synthesis of circuits in VHDL.

    A Real-Time Predictive Vehicular Collision Avoidance System on an Embedded General-Purpose GPU

    Collision avoidance is an essential capability for autonomous and assisted-driving ground vehicles. In this work, we developed a novel model predictive control based intelligent collision avoidance (CA) algorithm for a multi-trailer industrial ground vehicle implemented on a General Purpose Graphical Processing Unit (GPGPU). The CA problem is formulated as a multi-objective optimal control problem and solved using a limited look-ahead control scheme in real-time. Through hardware-in-the-loop-simulations and experimental results obtained in this work, we have demonstrated that the proposed algorithm, using NVIDA’s CUDA framework and the NVIDIA Jetson TX2 development platform, is capable of dynamically assisting drivers and maintaining the vehicle a safe distance from the detected obstacles on-thely. We have demonstrated that a GPGPU, paired with an appropriate algorithm, can be the key enabler in relieving the computational burden that is commonly associated with model-based control problems and thus make them suitable for real-time applications

    Development of a GPGPU accelerated tool to simulate advection-reaction-diffusion phenomena in 2D

    Computational models are powerful tools to the study of environmental systems, playing a fundamental role in several fields of research (hydrological sciences, biomathematics, atmospheric sciences, geosciences, among others). Most of these models require high computational capacity, especially when one considers high spatial resolution and the application to large areas. In this context, the exponential increase in computational power brought by General Purpose Graphics Processing Units (GPGPU) has drawn the attention of scientists and engineers to the development of low cost and high performance parallel implementations of environmental models. In this research, we apply GPGPU computing for the development of a model that describes the physical processes of advection, reaction and diffusion. This presentation is held in the form of three self-contained articles. In the first one, we present a GPGPU implementation for the solution of the 2D groundwater flow equation in unconfined aquifers for heterogenous and anisotropic media. We implement a finite difference solution scheme based on the Crank- Nicolson method and show that the GPGPU accelerated solution implemented using CUDA C/C++ (Compute Unified Device Architecture) greatly outperforms the corresponding serial solution implemented in C/C++. The results show that accelerated GPGPU implementation is capable of delivering up to 56 times acceleration in the solution process using an ordinary office computer. In the second article, we study the application of a diffusive-logistic growth (DLG) model to the problem of forest growth and regeneration. The study focuses on vegetation belonging to preservation areas, such as riparian buffer zones. The study was developed in two stages: (i) a methodology based on Artificial Neural Network Ensembles (ANNE) was applied to evaluate the width of riparian buffer required to filter 90% of the residual nitrogen; (ii) the DLG model was calibrated and validated to generate a prognostic of forest regeneration in riparian protection bands considering the minimum widths indicated by the ANNE. The solution was implemented in GPGPU and it was applied to simulate the forest regeneration process for forty years on the riparian protection bands along the Ligeiro river, in Brazil. The results from calibration and validation showed that the DLG model provides fairly accurate results for the modelling of forest regeneration. In the third manuscript, we present a GPGPU implementation of the solution of the advection-reaction-diffusion equation in 2D. The implementation is designed to be general and flexible to allow the modeling of a wide range of processes, including those with heterogeneity and anisotropy. We show that simulations performed in GPGPU allow the use of mesh grids containing more than 20 million points, corresponding to an area of 18,000 km? in a standard Landsat image resolution.Os modelos computacionais s?o ferramentas poderosas para o estudo de sistemas ambientais, desempenhando um papel fundamental em v?rios campos de pesquisa (ci?ncias hidrol?gicas, biomatem?tica, ci?ncias atmosf?ricas, geoci?ncias, entre outros). A maioria desses modelos requer alta capacidade computacional, especialmente quando se considera uma alta resolu??o espacial e a aplica??o em grandes ?reas. Neste contexto, o aumento exponencial do poder computacional trazido pelas Unidades de Processamento de Gr?ficos de Prop?sito Geral (GPGPU) chamou a aten??o de cientistas e engenheiros para o desenvolvimento de implementa??es paralelas de baixo custo e alto desempenho para modelos ambientais. Neste trabalho, aplicamos computa??o em GPGPU para o desenvolvimento de um modelo que descreve os processos f?sicos de advec??o, rea??o e difus?o. Esta disserta??o ? apresentada sob a forma de tr?s artigos. No primeiro, apresentamos uma implementa??o em GPGPU para a solu??o da equa??o de fluxo de ?guas subterr?neas 2D em aqu?feros n?o confinados para meios heterog?neos e anisotr?picos. Foi implementado um esquema de solu??o de diferen?as finitas com base no m?todo Crank- Nicolson e mostramos que a solu??o acelerada GPGPU implementada usando CUDA C / C ++ supera a solu??o serial correspondente implementada em C / C ++. Os resultados mostram que a implementa??o acelerada por GPGPU ? capaz de fornecer acelera??o de at? 56 vezes no processo da solu??o usando um computador de escrit?rio comum. No segundo artigo estudamos a aplica??o de um modelo de crescimento log?stico difusivo (DLG) ao problema de crescimento e regenera??o florestal. O estudo foi desenvolvido em duas etapas: (i) Aplicou-se uma metodologia baseada em Comites de Rede Neural Artificial (ANNE) para avaliar a largura da faixa de prote??o rip?ria necess?ria para filtrar 90% do nitrog?nio residual; (ii) O modelo DLG foi calibrado e validado para gerar um progn?stico de regenera??o florestal em faixas de prote??o rip?rias considerando as larguras m?nimas indicadas pela ANNE. A solu??o foi implementada em GPGPU e aplicada para simular o processo de regenera??o florestal para um per?odo de quarenta anos na faixa de prote??o rip?ria ao longo do rio Ligeiro, no Brasil. Os resultados da calibra??o e valida??o mostraram que o modelo DLG fornece resultados bastante precisos para a modelagem de regenera??o florestal. No terceiro artigo, apresenta-se uma implementa??o em GPGPU para solu??o da equa??o advec??o-rea??o-difus?o em 2D. A implementa??o ? projetada para ser geral e flex?vel para permitir a modelagem de uma ampla gama de processos, incluindo caracter?sticas como heterogeneidade e anisotropia do meio. Neste trabalho mostra-se que as simula??es realizadas em GPGPU permitem o uso de malhas contendo mais de 20 milh?es de pontos (vari?veis), correspondendo a uma ?rea de 18.000 km? em resolu??o de 30m padr?o das imagens Landsat