20 research outputs found

    Improving GPGPU Energy-Efficiency through Concurrent Kernel Execution and DVFS

    Get PDF

    Energy-efficient mobile GPU systems

    Get PDF
    The design of mobile GPUs is all about saving energy. Smartphones and tablets are battery-operated and thus any type of rendering needs to use as little energy as possible. Furthermore, smartphones do not include sophisticated cooling systems due to their small size, making heat dissipation a primary concern. Improving the energy-efficiency of mobile GPUs will be absolutely necessary to achieve the performance required to satisfy consumer expectations, while maintaining operating time per battery charge and keeping the GPU in its thermal limits. The first step in optimizing energy consumption is to identify the sources of energy drain. Previous studies have demonstrated that the register file is one of the main sources of energy consumption in a GPU. As graphics workloads are highly data- and memory-parallel, GPUs rely on massive multithreading to hide the memory latency and keep the functional units busy. However, aggressive multithreading requires a huge register file to keep the registers of thousands of simultaneous threads. Such a big register file exceeds the power budget typically available for an embedded graphics processors and, hence, more energy-efficient memory latency tolerance techniques are necessary. On the other hand, prior research showed that the off-chip accesses to system memory are one of the most expensive operations in terms of energy in a mobile GPU. Therefore, optimizing memory bandwidth usage is a primary concern in mobile GPU design. Many bandwidth saving techniques, such as texture compression or ARM's transaction elimination, have been proposed in both industry and academia. The purpose of this thesis is to study the characteristics of mobile graphics processors and mobile workloads in order to propose different energy saving techniques specifically tailored for the low-power segment. Firstly, we focus on energy-efficient memory latency tolerance. We analyze several techniques such as multithreading and prefetching and conclude that they are effective but not energy-efficient. Next, we propose an architecture for the fragment processors of a mobile GPU that is based on the decoupled access/execute paradigm. The results obtained by using a cycle-accurate mobile GPU simulator and several commercial Android games show that the decoupled architecture combined with a small degree of multithreading provides the most energy efficient solution for hiding memory latency. More specifically, the decoupled access/execute-like design with just 4 SIMD threads/processor is able to achieve 97% of the performance of a larger GPU with 16 SIMD threads/processor, while providing 20.5% energy savings on average. Secondly, we focus on optimizing memory bandwidth in a mobile GPU. We analyze the bandwidth usage in a set of commercial Android games and find that most of the bandwidth is employed for fetching textures, and also that consecutive frames share most of the texture dataset as they tend to be very similar. However, the GPU cannot capture inter-frame texture re-use due to the big size of the texture dataset for one frame. Based on this analysis, we propose Parallel Frame Rendering (PFR), a technique that overlaps the processing of multiple frames in order to exploit inter-frame texture re-use and save bandwidth. By processing multiple frames in parallel textures are fetched once every two frames instead of being fetched in a frame basis as in conventional GPUs. PFR provides 23.8% memory bandwidth savings on average in our set of Android games, that result in 12% speedup and 20.1% energy savings. Finally, we improve PFR by introducing a hardware memoization system on top. We analyze the redundancy in mobile games and find that more than 38% of the Fragment Program executions are redundant on average. We thus propose a task-level hardware-based memoization system that provides 15% speedup and 12% energy savings on average over a PFR-enabled GPU.El diseño de las GPUs (Graphics Procesing Units) móviles se centra fundamentalmente en el ahorro energético. Los smartphones y las tabletas son dispositivos alimentados mediante baterías y, por lo tanto, cualquier tipo de renderizado debe utilizar la menor cantidad de energía posible. Mejorar la eficiencia energética de las GPUs móviles será absolutamente necesario para alcanzar el rendimiento requirido para satisfacer las expectativas de los usuarios, sin reducir el tiempo de vida de la batería. El primer paso para optimizar el consumo energético consiste en identificar qué componentes son los principales consumidores de la batería. Estudios anteriores han identificado al banco de registros y a los accessos a memoria principal como las mayores fuentes de consumo energético en una GPU. El propósito de esta tesis es estudiar las características de los procesadores gráficos móviles y de las aplicaciones móviles con el objetivo de proponer distintas técnicas de ahorro energético. En primer lugar, la investigación se centra en desarrollar métodos energéticamente eficientes para ocultar la latencia de la memoria principal. El resultado de la investigación es una arquitectura desacoplada para los Fragment Processors de la GPU. Los resultados experimentales utilizando un simulador de ciclo y distintos juegos de Android muestran que una arquitectura desacoplada, combinada con un nivel de multithreading moderado, proporciona la solución más eficiente desde el punto de vista energético para ocultar la latencia de la memoria prinicipal. Más específicamente, la arquitectura desacoplada con sólo 4 SIMD threads/processor es capaz de alcanzar el 97% del rendimiento de una GPU más grande con 16 SIMD threads/processor, al tiempo que se reduce el consumo energético en un 20.5%. En segundo lugar, el trabajo de investigación se centró en optimizar el ancho de banda en una GPU móvil. Se realizó un estudio del uso del ancho de banda en distintos juegos de Android y se observó que la mayor parte del ancho de banda se utiliza para leer texturas. Además, se observó que frames consecutivos comparten una gran parte de las texturas. Sin embargo, la GPU no puede capturar el reuso de texturas entre frames dado que el tamaño de las texturas utilizadas por un frame es mucho mayor que la caché de segundo nivel. Basándose en este análisis, se desarrolló Parallel Frame Rendering (PFR), una técnica que solapa el procesado de multiples frames consecutivos con el objetivo de explotar el reuso de texturas entre frames y ahorrar así ancho de bando. Al procesar múltiples frames en paralelo las texturas se leen de memoria principal una vez cada dos frames en lugar de leerse en cada frame como sucede en una GPU convencional. PFR proporciona un ahorro del 23.8% en ancho de banda en promedio para distintos juegos de Android, este ahorro de ancho de banda redunda en un incremento del rendimiento del 12% y un ahorro energético del 20.1%. Por último, se mejoró PFR introduciendo un sistema hardware capaz de evitar cómputos redundantes. Un análisis de distintos juegos de Android reveló que más de un 38% de las ejecuciones del Fragment Program eran redundantes en promedio. Así pues, se propuso un sistema hardware capaz de identificar y eliminar parte de los cómputos y accessos a memoria redundantes, dicho sistema proporciona un incremento del rendimiento del 15% y un ahorro energético del 12% en promedio con respecto a una GPU móvil basada en PFR

    Research on real-time physics-based deformation for haptic-enabled medical simulation

    Full text link
    This study developed a multiple effective visuo-haptic surgical engine to handle a variety of surgical manipulations in real-time. Soft tissue models are based on biomechanical experiment and continuum mechanics for greater accuracy. Such models will increase the realism of future training systems and the VR/AR/MR implementations for the operating room

    Techniques of design optimisation for algorithms implemented in software

    Get PDF
    The overarching objective of this thesis was to develop tools for parallelising, optimising, and implementing algorithms on parallel architectures, in particular General Purpose Graphics Processors (GPGPUs). Two projects were chosen from different application areas in which GPGPUs are used: a defence application involving image compression, and a modelling application in bioinformatics (computational immunology). Each project had its own specific objectives, as well as supporting the overall research goal. The defence / image compression project was carried out in collaboration with the Jet Propulsion Laboratories. The specific questions were: to what extent an algorithm designed for bit-serial for the lossless compression of hyperspectral images on-board unmanned vehicles (UAVs) in hardware could be parallelised, whether GPGPUs could be used to implement that algorithm, and whether a software implementation with or without GPGPU acceleration could match the throughput of a dedicated hardware (FPGA) implementation. The dependencies within the algorithm were analysed, and the algorithm parallelised. The algorithm was implemented in software for GPGPU, and optimised. During the optimisation process, profiling revealed less than optimal device utilisation, but no further optimisations resulted in an improvement in speed. The design had hit a local-maximum of performance. Analysis of the arithmetic intensity and data-flow exposed flaws in the standard optimisation metric of kernel occupancy used for GPU optimisation. Redesigning the implementation with revised criteria (fused kernels, lower occupancy, and greater data locality) led to a new implementation with 10x higher throughput. GPGPUs were shown to be viable for on-board implementation of the CCSDS lossless hyperspectral image compression algorithm, exceeding the performance of the hardware reference implementation, and providing sufficient throughput for the next generation of image sensor as well. The second project was carried out in collaboration with biologists at the University of Arizona and involved modelling a complex biological system – VDJ recombination involved in the formation of T-cell receptors (TCRs). Generation of immune receptors (T cell receptor and antibodies) by VDJ recombination is an enormously complex process, which can theoretically synthesize greater than 1018 variants. Originally thought to be a random process, the underlying mechanisms clearly have a non-random nature that preferentially creates a small subset of immune receptors in many individuals. Understanding this bias is a longstanding problem in the field of immunology. Modelling the process of VDJ recombination to determine the number of ways each immune receptor can be synthesized, previously thought to be untenable, is a key first step in determining how this special population is made. The computational tools developed in this thesis have allowed immunologists for the first time to comprehensively test and invalidate a longstanding theory (convergent recombination) for how this special population is created, while generating the data needed to develop novel hypothesis

    NASA Technology Applications Team: Commercial applications of aerospace technology

    Get PDF
    The Research Triangle Institute (RTI) Team has maintained its focus on helping NASA establish partnerships with U.S. industry for dual use development and technology commercialization. Our emphasis has been on outcomes, such as licenses, industry partnerships and commercialization of technologies, that are important to NASA in its mission of contributing to the improved competitive position of U.S. industry. The RTI Team has been successful in the development of NASA/industry partnerships and commercialization of NASA technologies. RTI ongoing commitment to quality and customer responsiveness has driven our staff to continuously improve our technology transfer methodologies to meet NASA's requirements. For example, RTI has emphasized the following areas: (1) Methodology For Technology Assessment and Marketing: RTI has developed and implemented effective processes for assessing the commercial potential of NASA technologies. These processes resulted from an RTI study of best practices, hands-on experience, and extensive interaction with the NASA Field Centers to adapt to their specific needs. (2) Effective Marketing Strategies: RTI surveyed industry technology managers to determine effective marketing tools and strategies. The Technology Opportunity Announcement format and content were developed as a result of this industry input. For technologies with a dynamic visual impact, RTI has developed a stand-alone demonstration diskette that was successful in developing industry interest in licensing the technology. And (3) Responsiveness to NASA Requirements: RTI listened to our customer (NASA) and designed our processes to conform with the internal procedures and resources at each NASA Field Center and the direction provided by NASA's Agenda for Change. This report covers the activities of the Research Triangle Institute Technology Applications Team for the period 1 October 1993 through 31 December 1994

    NASA Technology Applications Team: Commercial applications of aerospace technology

    Get PDF
    The Research Triangle Institute (RTI) is pleased to report the results of NASA contract NASW-4367, 'Operation of a Technology Applications Team'. Through a period of significant change within NASA, the RTI Team has maintained its focus on helping NASA establish partnerships with U.S. industry for dual use development and technology commercialization. Our emphasis has been on outcomes, such as licenses, industry partnerships and commercialization of technologies that are important to NASA in its mission of contributing to the improved competitive position of U.S. industry. RTI's ongoing commitment to quality and customer responsiveness has driven our staff to continuously improve our technology transfer methodologies to meet NASA's requirements. For example, RTI has emphasized the following areas: (1) Methodology For Technology Assessment and Marketing: RTI has developed an implemented effective processes for assessing the commercial potential of NASA technologies. These processes resulted from an RTI study of best practices, hands-on experience, and extensive interaction with the NASA Field Centers to adapt to their specific needs; (2) Effective Marketing Strategies: RTI surveyed industry technology managers to determine effective marketing tools and strategies. The Technology Opportunity Announcement format and content were developed as a result of this industry input. For technologies with a dynamic visual impact, RTI has developed a stand-alone demonstration diskette that was successful in developing industry interest in licensing the technology; and (3) Responsiveness to NASA Requirements: RTI listened to our customer (NASA) and designed our processes to conform with the internal procedures and resources at each NASA Field Center and the direction provided by NASA's Agenda for Change. This report covers the activities of the Research Triangle Institute Technology Applications Team for the period 1 October 1993 through 31 December 1994

    High-fidelity graphics using unconventional distributed rendering approaches

    Get PDF
    High-fidelity rendering requires a substantial amount of computational resources to accurately simulate lighting in virtual environments. While desktop computing, with the aid of modern graphics hardware, has shown promise in delivering realistic rendering at interactive rates, real-time rendering of moderately complex scenes is still unachievable on the majority of desktop machines and the vast plethora of mobile computing devices that have recently become commonplace. This work provides a wide range of computing devices with high-fidelity rendering capabilities via oft-unused distributed computing paradigms. It speeds up the rendering process on formerly capable devices and provides full functionality to incapable devices. Novel scheduling and rendering algorithms have been designed to best take advantage of the characteristics of these systems and demonstrate the efficacy of such distributed methods. The first is a novel system that provides multiple clients with parallel resources for rendering a single task, and adapts in real-time to the number of concurrent requests. The second is a distributed algorithm for the remote asynchronous computation of the indirect diffuse component, which is merged with locally-computed direct lighting for a full global illumination solution. The third is a method for precomputing indirect lighting information for dynamically-generated multi-user environments by using the aggregated resources of the clients themselves. The fourth is a novel peer-to-peer system for improving the rendering performance in multi-user environments through the sharing of computation results, propagated via a mechanism based on epidemiology. The results demonstrate that the boundaries of the distributed computing typically used for computer graphics can be significantly and successfully expanded by adapting alternative distributed methods

    Power System Simulation, Control and Optimization

    Get PDF
    This Special Issue “Power System Simulation, Control and Optimization” offers valuable insights into the most recent research developments in these topics. The analysis, operation, and control of power systems are increasingly complex tasks that require advanced simulation models to analyze and control the effects of transformations concerning electricity grids today: Massive integration of renewable energies, progressive implementation of electric vehicles, development of intelligent networks, and progressive evolution of the applications of artificial intelligence