440 research outputs found

    GPU High-Performance Framework for PIC-like Simulation Methods Using the Vulkan® Explicit API

    Get PDF
    Within computational continuum mechanics there exists a large category of simulation methods which operate by tracking Lagrangian particles over an Eulerian background grid. These Lagrangian/Eulerian hybrid methods, descendants of the Particle-In-Cell method (PIC), have proven highly effective at simulating a broad range of materials and mechanics including fluids, solids, granular materials, and plasma. These methods remain an area of active research after several decades, and their applications can be found across scientific, engineering, and entertainment disciplines. This thesis presents a GPU driven PIC-like simulation framework created using the Vulkan® API. Vulkan is a cross-platform and open-standard explicit API for graphics and GPU compute programming. Compared to its predecessors, Vulkan offers lower overhead, support for host parallelism, and finer grain control over both device resources and scheduling. This thesis harnesses those advantages to create a programmable GPU compute pipeline backed by a Vulkan adaptation of the SPgrid data-structure and multi-buffered particle arrays. The CPU host system works asynchronously with the GPU to maximize utilization of both the host and device. The framework is demonstrated to be capable of supporting Particle-in-Cell like simulation methods, making it viable for GPU acceleration of many Lagrangian particle on Eulerian grid hybrid methods. This novel framework is the first of its kind to be created using Vulkan® and to take advantage of GPU sparse memory features for grid sparsity

    Effect of program structure on program behaviour in virtual memory systems

    Get PDF

    Virtualizing Data Parallel Systems for Portability, Productivity, and Performance.

    Full text link
    Computer systems equipped with graphics processing units (GPUs) have become increasingly common over the last decade. In order to utilize the highly data parallel architecture of GPUs for general purpose applications, new programming models such as OpenCL and CUDA were introduced, showing that data parallel kernels on GPUs can achieve speedups by several orders of magnitude. With this success, applications from a variety of domains have been converted to use several complicated OpenCL/CUDA data parallel kernels to benefit from data parallel systems. Simultaneously, the software industry has experienced a massive growth in the amount of data to process, demanding more powerful workhorses for data parallel computation. Consequently, additional parallel computing devices such as extra GPUs and co-processors are attached to the system, expecting more performance and capability to process larger data. However, these programming models expose hardware details to programmers, such as the number of computing devices, interconnects, and physical memory size of each device. This degrades productivity in the software development process as programmers must manually split the workload with regard to hardware characteristics. This process is tedious and prone to errors, and most importantly, it is hard to maximize the performance at compile time as programmers do not know the runtime behaviors that can affect the performance such as input size and device availability. Therefore, applications lack portability as they may fail to run due to limited physical memory or experience suboptimal performance across different systems. To cope with these challenges, this thesis proposes a dynamic compiler framework that provides the OpenCL applications with an abstraction layer for physical devices. This abstraction layer virtualizes physical devices and memory sub-systems, and transparently orchestrates the execution of multiple data parallel kernels on multiple computing devices. The framework significantly improves productivity as it provides hardware portability, allowing programmers to write an OpenCL program without being concerned of the target devices. Our framework also maximizes performance by balancing the data parallel workload considering factors like kernel dependencies, device performance variation on workloads of different sizes, the data transfer cost over the interconnect between devices, and physical memory limits on each device.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/113361/1/jhaeng_1.pd

    Optimisation of a parallel ocean general circulation model

    Get PDF
    Abstract. This paper presents the development of a general-purpose parallel ocean circulation model, for use on a wide range of computer platforms, from traditional scalar machines to workstation clusters and massively parallel processors. Parallelism is provided, as a modular option, via high-level message-passing rou- tines, thus hiding the technical intricacies from the user. An initial implementation highlights that the parallel e?ciency of the model is adversely a?ected by a number of factors, for which optimisations are discussed and implemented. The resulting ocean code is portable and, in particular, allows science to be achieved on local workstations that could otherwise only be undertaken on state-of-the-art supercomputers

    Efficient And Scalable Evaluation Of Continuous, Spatio-temporal Queries In Mobile Computing Environments

    Get PDF
    A variety of research exists for the processing of continuous queries in large, mobile environments. Each method tries, in its own way, to address the computational bottleneck of constantly processing so many queries. For this research, we present a two-pronged approach at addressing this problem. Firstly, we introduce an efficient and scalable system for monitoring traditional, continuous queries by leveraging the parallel processing capability of the Graphics Processing Unit. We examine a naive CPU-based solution for continuous range-monitoring queries, and we then extend this system using the GPU. Additionally, with mobile communication devices becoming commodity, location-based services will become ubiquitous. To cope with the very high intensity of location-based queries, we propose a view oriented approach of the location database, thereby reducing computation costs by exploiting computation sharing amongst queries requiring the same view. Our studies show that by exploiting the parallel processing power of the GPU, we are able to significantly scale the number of mobile objects, while maintaining an acceptable level of performance. Our second approach was to view this research problem as one belonging to the domain of data streams. Several works have convincingly argued that the two research fields of spatiotemporal data streams and the management of moving objects can naturally come together. [IlMI10, ChFr03, MoXA04] For example, the output of a GPS receiver, monitoring the position of a mobile object, is viewed as a data stream of location updates. This data stream of location updates, along with those from the plausibly many other mobile objects, is received at a centralized server, which processes the streams upon arrival, effectively updating the answers to the currently active queries in real time. iv For this second approach, we present GEDS, a scalable, Graphics Processing Unit (GPU)-based framework for the evaluation of continuous spatio-temporal queries over spatiotemporal data streams. Specifically, GEDS employs the computation sharing and parallel processing paradigms to deliver scalability in the evaluation of continuous, spatio-temporal range queries and continuous, spatio-temporal kNN queries. The GEDS framework utilizes the parallel processing capability of the GPU, a stream processor by trade, to handle the computation required in this application. Experimental evaluation shows promising performance and shows the scalability and efficacy of GEDS in spatio-temporal data streaming environments. Additional performance studies demonstrate that, even in light of the costs associated with memory transfers, the parallel processing power provided by GEDS clearly counters and outweighs any associated costs. Finally, in an effort to move beyond the analysis of specific algorithms over the GEDS framework, we take a broader approach in our analysis of GPU computing. What algorithms are appropriate for the GPU? What types of applications can benefit from the parallel and stream processing power of the GPU? And can we identify a class of algorithms that are best suited for GPU computing? To answer these questions, we develop an abstract performance model, detailing the relationship between the CPU and the GPU. From this model, we are able to extrapolate a list of attributes common to successful GPU-based applications, thereby providing insight into which algorithms and applications are best suited for the GPU and also providing an estimated theoretical speedup for said GPU-based application

    Real-time high-performance computing for embedded control systems

    Get PDF
    The real-time control systems industry is moving towards the consolidation of multiple computing systems into fewer and more powerful ones, aiming for a reduction in size, weight, and power. The increasing demand for higher performance in other critical domains like autonomous driving has led the industry to recently include embedded GPUs for the implementation of advanced functionalities. The highly parallel architecture of GPUs could also be leveraged in the control systems industry to develop more advanced, energy-efficient, and scalable control systems. However, the closed-source and non-deterministic nature of GPUs complicates the resource provisioning analysis required for the implementation of critical real-time systems. On the other hand, there is no indication of the integration of GPUs in the traditional development cycle of control systems, which is oriented to the use of a model-based design approach. Recently, some model-based design tools vendors have extended their development frameworks with GPU code generation capabilities targeting hybrid computing platforms, so that the model-based design environment now enables the concurrent analysis of more complex and diverse functions by simulation and automating the deployment to the final target. However, there is no indication whether these tools are well-suited for the design and development of time-sensitive systems. Motivated by these challenges, in this thesis, we contribute to the state of the art of real-time control systems towards the adoption of embedded GPUs by providing tools to facilitate the resource provisioning analysis and the integration in the model-based design development cycle. First, we present a methodology and an automated tool to extract the properties of GPU memory allocators. This tool allows the computation of the real amount of memory used by GPU applications, facilitating a correct resource provisioning analysis. Then, we present a library which allows the characterization of the use of dynamic memory in GPU applications. We use this library to characterize GPU benchmarks and we identify memory allocation patterns that could be modified to improve performance and memory consumption when targeting embedded GPUs. Based on these results, we present a tool to optimize the use of dynamic memory in legacy GPU applications executed on embedded platforms. This tool allows us to minimize the memory consumption and memory management overhead of GPU applications without rewriting them. Afterwards, we analyze the timing of control algorithms executed in embedded GPUs and we identify techniques to achieve an acceptable real-time behavior. Finally, we evaluate model-based design tools in terms of integration with GPU hardware and GPU code generation, and we propose improvements for the model-based generated GPU code. Then, we present a source-to-source transformation tool to automatically apply the proposed improvements.La industria de los sistemas de control en tiempo real avanza hacia la consolidación de múltiples sistemas informáticos en menos y más potentes sistemas, con el objetivo de reducir el tamaño, el peso y el consumo. La creciente demanda de un mayor rendimiento en otros dominios críticos, como la conducción autónoma, ha llevado a la industria a incluir recientemente GPU embebidas para la implementación de funcionalidades avanzadas. La arquitectura altamente paralela de las GPU también podría aprovecharse en la industria de los sistemas de control para desarrollar sistemas de control más avanzados, eficientes energéticamente y escalables. Sin embargo, la naturaleza privativa y no determinista de las GPUs complica el análisis de aprovisionamiento de recursos requerido para la implementación de sistemas críticos en tiempo real. Por otro lado, no hay indicios de la integración de las GPU en el ciclo de desarrollo tradicional de los sistemas de control, que está orientado al uso de un enfoque de diseño basado en modelos. Recientemente, algunos proveedores de herramientas de diseño basado en modelos han ampliado sus entornos de desarrollo con capacidades de generación de código de GPU dirigidas a plataformas informáticas híbridas, de modo que el entorno de diseño basado en modelos ahora permite el análisis simultáneo de funciones más complejas y diversas mediante la simulación y la automatización de la implementación para el objetivo final. Sin embargo, no hay indicación de si estas herramientas son adecuadas para el diseño y desarrollo de sistemas sensibles al tiempo. Motivados por estos desafíos, en esta tesis contribuimos al estado del arte de los sistemas de control en tiempo real hacia la adopción de GPUs integradas al proporcionar herramientas para facilitar el análisis de aprovisionamiento de recursos y la integración en el ciclo de desarrollo de diseño basado en modelos. Primero, presentamos una metodología y una herramienta automatizada para extraer las propiedades de los asignadores de memoria en GPUs. Esta herramienta permite el cómputo de la cantidad real de memoria utilizada por las aplicaciones GPU, facilitando un correcto análisis del aprovisionamiento de recursos. Luego, presentamos una librería que permite la caracterización del uso de memoria dinámica en aplicaciones de GPU. Usamos esta librería para caracterizar una serie de benchmarks GPU e identificamos patrones de asignación de memoria que podrían modificarse para mejorar el rendimiento y el consumo de memoria al utilizar GPUs embebidas. Con base en estos resultados, presentamos también una herramienta para optimizar el uso de la memoria dinámica en aplicaciones de GPU heredadas al ser ejecutadas en plataformas embebidas. Esta herramienta nos permite minimizar el consumo de memoria y la sobrecarga de administración de memoria de las aplicaciones GPU sin necesidad de reescribirlas. Posteriormente, analizamos el tiempo de los algoritmos de control ejecutados en GPUs embebidas e identificamos técnicas para lograr un comportamiento de tiempo real aceptable. Finalmente, evaluamos las herramientas de diseño basadas en modelos en términos de integración con hardware GPU y generación de código GPU, y proponemos mejoras para el código GPU generado por las herramientas basadas en modelos. Luego, presentamos una herramienta de transformación de código fuente para aplicar automáticamente al código generado las mejoras propuestas.Postprint (published version

    Data cache organization for accurate timing analysis

    Get PDF
    corecore