15 research outputs found

    Towards general purpose computations on low-end mobile GPUs

    Get PDF
    GPUs traditionally offer high computational capabilities, frequently higher than their CPU counterparts. While high-end mobile GPUs vendors introduced recently general purpose APIs, such as OpenCL, to leverage their computational power, the vast majority of the mobile devices lack such support. Despite that their graphics APIs have similarities with desktop graphics APIs, they have significant differences, which prevent the use of well-known techniques that offer general-purpose computations over such interfaces. In this paper we show how these obstacles can be overcome, in order to achieve general purpose programmability of these devices. As a proof of concept we implemented our proposal on a real embedded platform (Raspberry Pi) based on Broadcom's VideoCore IV GPU, obtaining a speedup of 7.2× over the CPU.This work has been partially supported by the Spanish Ministry of Science and Innovation under grant TIN2015-65316-P and the HiPEAC Network of Excellence. Leonidas Kosmidis is also funded by the Spanish Ministry of Education under the FPU grant AP2010-4208.Postprint (author's final draft

    A Machine Learning and Computer Vision Application to Robustly Extract Winnings from Multiple Lottery Tickets in One Shot

    Get PDF
    Mega Millions and Powerball are among the most popular American lottery games. This article provides a practical software application that can conveniently examine and evaluate several lottery tickets for prizes using just the images. The application accepts as input a directory containing the images of lottery tickets and utilizes machine learning and computer vision to extract lottery ticket data, lottery name, lottery draw date, 5-digit lottery numbers, 2-digit lottery "ball" numbers, and the lottery multiplier. The application also retrieves winning lottery data that corresponds to the lottery draw date using a public database API. This is compared with data collected from each lottery ticket image to establish matches, and the corresponding prize amount is computed. The current version of the application supports GPU usage, and image orientation has no impact on its functionality.  It is believed that a considerable portion of the U.S. public participating in the Powerball and Mega Millions lotteries will find such an application beneficial and handy

    A Machine Learning and Computer Vision Application to Robustly Extract Winnings from Multiple Lottery Tickets in One Shot

    Get PDF
    Mega Millions and Powerball are among the most popular American lottery games. This article provides a practical software application that can conveniently examine and evaluate several lottery tickets for prizes using just the images. The application accepts as input a directory containing the images of lottery tickets and utilizes machine learning and computer vision to extract lottery ticket data, lottery name, lottery draw date, 5-digit lottery numbers, 2-digit lottery "ball" numbers, and the lottery multiplier. The application also retrieves winning lottery data that corresponds to the lottery draw date using a public database API. This is compared with data collected from each lottery ticket image to establish matches, and the corresponding prize amount is computed. The current version of the application supports GPU usage, and image orientation has no impact on its functionality.  It is believed that a considerable portion of the U.S. public participating in the Powerball and Mega Millions lotteries will find such an application beneficial and handy

    Dynamic sampling rate: harnessing frame coherence in graphics applications for energy-efficient GPUs

    Get PDF
    In real-time rendering, a 3D scene is modelled with meshes of triangles that the GPU projects to the screen. They are discretized by sampling each triangle at regular space intervals to generate fragments which are then added texture and lighting effects by a shader program. Realistic scenes require detailed geometric models, complex shaders, high-resolution displays and high screen refreshing rates, which all come at a great compute time and energy cost. This cost is often dominated by the fragment shader, which runs for each sampled fragment. Conventional GPUs sample the triangles once per pixel; however, there are many screen regions containing low variation that produce identical fragments and could be sampled at lower than pixel-rate with no loss in quality. Additionally, as temporal frame coherence makes consecutive frames very similar, such variations are usually maintained from frame to frame. This work proposes Dynamic Sampling Rate (DSR), a novel hardware mechanism to reduce redundancy and improve the energy efficiency in graphics applications. DSR analyzes the spatial frequencies of the scene once it has been rendered. Then, it leverages the temporal coherence in consecutive frames to decide, for each region of the screen, the lowest sampling rate to employ in the next frame that maintains image quality. We evaluate the performance of a state-of-the-art mobile GPU architecture extended with DSR for a wide variety of applications. Experimental results show that DSR is able to remove most of the redundancy inherent in the color computations at fragment granularity, which brings average speedups of 1.68x and energy savings of 40%.This work has been supported by the the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (Grant No. 833057), Spanish State Research Agency (MCIN/AEI) under Grant PID2020-113172RB-I00, the ICREA Academia program, and the Generalitat de Catalunya under Grant FI-DGR 2016. Funding was provided by Ministerio de Economía, Industria y Competitividad, Gobierno de España (Grant No. TIN2016-75344-R).Peer ReviewedPostprint (published version

    Triangle Dropping: An occluded-geometry predictor for energy-efficient mobile GPUs

    Get PDF
    This article proposes a novel micro-architecture approach for mobile GPUs aimed at early removing the occluded geometry in a scene by leveraging frame-to-frame coherence, thus reducing the overall energy consumption. Mobile GPUs commonly implement a Tile-Based Rendering (TBR) architecture that differentiates two main phases: the Geometry Pipeline, where all the geometry of a scene is processed; and the Raster Pipeline, where primitives are rendered in a framebuffer. After the Geometry Pipeline, only non-culled primitives inside the camera’s frustum are stored into the Parameter Buffer, a data structure stored in DRAM. However, among the non-culled primitives there is a significant amount that are rendered but non-visible at all, resulting in useless computations. On average, 60% of those primitives are completely occluded in our benchmarks. Despite TBR architectures use on-chip caches for the Parameter Buffer, about 46% of the DRAM traffic still comes from accesses to such buffer. The proposed Triangle Dropping technique leverages the visibility information computed along the Raster Pipeline to predict the primitives’ visibility in the next frame to early discard those that will be totally occluded, drastically reducing Parameter Buffer accesses. On average, our approach achieves overall 14.5% energy savings, 28.2% energy-delay product savings, and a speedup of 20.2%.This work has been supported by the CoCoUnit ERC Advanced Grant of the EU’s Horizon 2020 program (grant no. 833057), the Spanish State Research Agency (MCIN/AEI) under grant PID2020-113172RB-I00 (AEI/FEDER, EU), and the ICREA Academia program. D. Corbalán-Navarro has been also supported by a PhD research fellowship from the University of Murcia’s “Plan Propio de Investigación.Peer ReviewedPostprint (author's final draft

    Visibility rendering order: Improving energy efficiency on mobile GPUs through frame coherence

    Get PDF
    During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image, thus wasting precious time and energy. To help discard occluded surfaces, most current GPUs include an Early-Depth test before the fragment processing stage. However, to be effective it requires that opaque objects are processed in a front-to-back order. Depth sorting and other occlusion culling techniques at the object level incur overheads that are only offset for applications having substantial depth and/or fragment shading complexity, which is often not the case in mobile workloads. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware by exploiting the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence). Since order relationships are already tested by the Depth Test, VRO incurs minimal energy overheads because it just requires adding a small hardware to capture that information and use it later to guide the rendering of the following frame. Moreover, unlike other approaches, this unit works in parallel with the graphics pipeline without any performance overhead. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average over a state-of-the-art mobile GPU.Peer ReviewedPostprint (author's final draft

    Journal of Real-Time Image Processing manuscript No. (will be inserted by the editor) Evaluation of real-time LBP computing in multiple architectures

    Get PDF
    Abstract Local Binary Pattern (LBP) is a texture operator that is used in several different computer vision applications requiring, in many cases, real-time operation in multiple computing platforms. The irruption of new video standards has increased the typical resolutions and frame rates, which need considerable computational performance. Since LBP is essentially a pixel operator that scales with image size, typical straightforward implementations are usually insufficient to meet these requirements. To identify the solutions that maximize the performance of the real-time LBP extraction, we compare a series different implementations in terms of computational performance and energy efficiency while analyzing the different optimizations that can be made to reach real-time performance on multiple platforms and their different available computing resources. Our contribution addresses the extensive survey of LBP implementations in different platforms that can be found in the literature. To provide for a more complete evaluation, we have implemented the LBP algorithms in several platforms such as Graphics Processing Units, mobile processors and a hybrid programming model image coprocessor. We have extended the evaluation of some of the solutions that can be found in previous work. In addition, we publish the source code of our implementations

    Reducing redundancy of real time computer graphics in mobile systems

    Get PDF
    The goal of this thesis is to propose novel and effective techniques to eliminate redundant computations that waste energy and are performed in real-time computer graphics applications, with special focus on mobile GPU micro-architecture. Improving the energy-efficiency of CPU/GPU systems is not only key to enlarge their battery life, but also allows to increase their performance because, to avoid overheating above thermal limits, SoCs tend to be throttled when the load is high for a large period of time. Prior studies pointed out that the CPU and especially the GPU are the principal energy consumers in the graphics subsystem, being the off-chip main memory accesses and the processors inside the GPU the primary energy consumers of the graphics subsystem. First, we focus on reducing redundant fragment processing computations by means of improving the culling of hidden surfaces. During real-time graphics rendering, objects are processed by the GPU in the order they are submitted by the CPU, and occluded surfaces are often processed even though they will end up not being part of the final image. When the GPU realizes that an object or part of it is not going to be visible, all activity required to compute its color and store it has already been performed. We propose a novel architectural technique for mobile GPUs, Visibility Rendering Order (VRO), which reorders objects front-to-back entirely in hardware to maximize the culling effectiveness of the GPU and minimize overshading, hence reducing execution time and energy consumption. VRO exploits the fact that the objects in graphics animated applications tend to keep its relative depth order across consecutive frames (temporal coherence) to provide the feeling of smooth transition. VRO keeps visibility information of a frame, and uses it to reorder the objects of the following frame. VRO just requires adding a small hardware to capture the visibility information and use it later to guide the rendering of the following frame. Moreover, VRO works in parallel with the graphics pipeline, so negligible performance overheads are incurred. We illustrate the benefits of VRO using various unmodified commercial 3D applications for which VRO achieves 27% speed-up and 14.8% energy reduction on average. Then, we focus on avoiding redundant computations related to CPU Collision Detection (CD). Graphics applications such as 3D games represent a large percentage of downloaded applications for mobile devices and the trend is towards more complex and realistic scenes with accurate 3D physics simulations. CD is one of the most important algorithms in any physics kernel since it identifies the contact points between the objects of a scene and determines when they collide. However, real-time accurate CD is very expensive in terms of energy consumption. We propose Render Based Collision Detection (RBCD), a novel energy-efficient high-fidelity CD scheme that leverages some intermediate results of the rendering pipeline to perform CD, so that redundant tasks are done just once. Comparing RBCD with a conventional CD completely executed in the CPU, we show that its execution time is reduced by almost three orders of magnitude (600x speedup), because most of the CD task of our model comes for free by reusing the image rendering intermediate results. Although not necessarily, such a dramatic time improvement may result in better frames per second if physics simulation stays in the critical path. However, the most important advantage of our technique is the enormous energy savings that result from eliminating a long and costly CPU computation and converting it into a few simple operations executed by a specialized hardware within the GPU. Our results show that the energy consumed by CD is reduced on average by a factor of 448x (i.e., by 99.8\%). These dramatic benefits are accompanied by a higher fidelity CD analysis (i.e., with finer granularity), which improves the quality and realism of the application.El objetivo de esta tesis es proponer técnicas efectivas y originales para eliminar computaciones inútiles que aparecen en aplicaciones gráficas, con especial énfasis en micro-arquitectura de GPUs. Mejorar la eficiencia energética de los sistemas CPU/GPU no es solo clave para alargar la vida de la batería, sino también incrementar su rendimiento. Estudios previos han apuntado que la CPU y especialmente la GPU son los principales consumidores de energía en el sub-sistema gráfico, siendo los accesos a memoria off-chip y los procesadores dentro de la GPU los principales consumidores de energía del sub-sistema gráfico. Primero, nos hemos centrado en reducir computaciones redundantes de la fase de fragment processing mediante la mejora en la eliminación de superficies ocultas. Durante el renderizado de gráficos en tiempo real, los objetos son procesados por la GPU en el orden en el que son enviados por la CPU, y las superficies ocultas son a menudo procesadas incluso si no no acaban formando parte de la imagen final. Cuando la GPU averigua que el objeto o parte de él no es visible, toda la actividad requerida para computar su color y guardarlo ha sido realizada. Proponemos una técnica arquitectónica original para GPUs móviles, Visibility Rendering Order (VRO), la cual reordena los objetos de delante hacia atrás por completo en hardware para maximizar la efectividad del culling de la GPU y así minimizar el overshading, y por lo tanto reducir el tiempo de ejecución y el consumo de energía. VRO explota el hecho de que los objetos de las aplicaciones gráficas animadas tienden a mantener su orden relativo en profundidad a través de frames consecutivos (coherencia temporal) para proveer animaciones con transiciones suaves. Dado que las relaciones de orden en profundidad entre objetos son testeadas en la GPU, VRO introduce costes mínimos en energía. Solo requiere añadir una pequeña unidad hardware para capturar la información de visibilidad. Además, VRO trabaja en paralelo con el pipeline gráfico, por lo que introduce costes insignificantes en tiempo. Ilustramos los beneficios de VRO usango varias aplicaciones 3D comerciales para las cuales VRO consigue un 27% de speed-up y un 14.8% de reducción de energía en media. En segundo lugar, evitamos computaciones redundantes relacionadas con la Detección de Colisiones (CD) en la CPU. Las aplicaciones gráficas animadas como los juegos 3D representan un alto porcentaje de las aplicaciones descargadas en dispositivos móviles y la tendencia es hacia escenas más complejas y realistas con simulaciones físicas 3D precisas. La CD es uno de los algoritmos más importantes entre los kernel de físicas dado que identifica los puntos de contacto entre los objetos de una escena. Sin embargo, una CD en tiempo real y precisa es muy costosa en términos de consumo energético. Proponemos Render Based Collision Detection (RBCD), una técnica energéticamente eficiente y preciso de CD que utiliza resultados intermedios del rendering pipeline para realizar la CD. Comparando RBCD con una CD convencional completamente ejecutada en la CPU, mostramos que el tiempo de ejecución es reducido casi tres órdenes de magnitud (600x speedup), porque la mayoría de la CD de nuestro modelo reusa resultados intermedios del renderizado de la imagen. Aunque no es así necesariamente, esta espectacular en tiempo puede resultar en mejores frames por segundo si la simulación de físicas está en el camino crítico. Sin embargo, la ventaja más importante de nuestra técnica es el enorme ahorro de energía que resulta de eliminar las largas y costosas computaciones en la CPU, sustituyéndolas por unas pocas operaciones ejecutadas en un hardware especializado dentro de la GPU. Nuestros resultados muestran que la energía consumida por la CD es reducidad en media por un factor de 448x. Estos dramáticos beneficios vienen acompañados de una mayor fidelidad en la CD (i.e. con granularidad más fina)Postprint (published version

    Enhancing graphics quality and optimizing power consumption considering the human visual system in mobile devices

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 신현식.최근까지 GPU의 하드웨어가 눈에 띄게 발전하고 있지만, 아직도 60fps를 만족하면서 높은 품질의 그래픽 요구사항을 만족하기 어렵다. 또한 최근 높은 해상도의 요구사항은 전력 소모와 온도 문제 관점에서도 매우 어려운 문제이다. GPU의 전력 소모는 GPU의 연산량과 정비례하기 때문에, 사람의 인지 능력 관점에서 이득이 없음에도 불구하고, 고정된 높은 해상도와 높은 프레임 속도로 인한 GPU 높은 연산량은 의미가 없다. 본 논문에서는 사람의 인지 능력을 고려한 GPU 연산량을 줄이는 새로운 방법들을 제안한다. 사람의 인지 능력을 고려한 GPU 연산량을 줄이는 시작 단계로, 전력 소모의 주요 요인들을 상용화된 LG G3 모바일 기기로 분석한다. 이 과정을 통해 모바일 GPU의 전력 소모의 3 가지 주요 요인인 해상도, 프레임 속도 그리고 데이터 중복성에 대해 분석한다. 이러한 주요 요인들을 기반으로 사람의 인지 능력을 고려한 새로운 렌더링 기법들을 통해 연산량을 효과적으로 절감하는 기법들을 제안한다. 첫째로 해상도 관점에서 GPU에서의 해상도 변경 기반 연산량 감소 기법들에 대해 제안한다. 최근의 연구들은 사람의 인지능력과 콘텐츠의 특성을 반영하지 못하여, 그래픽 결점이 지속적으로 관찰된다. 기존 연구들과는 다르게, 제안하는 동적 렌더링 화질 개선 스케일링 (Dynamic Rendering Quality Scaling: DRQS)은 최소한의 추가비용으로 변환 행렬을 활용한 프레임 간 변화량을 이용하여 해상도 조절 및 품질 개선 스케일링을 통해 성능을 최대 38%까지 개선한다. 또한 저 사양 그래픽스 응용프로그램의 경우에서는 사람의 인지 능력관점에서 그래픽 품질의 감소 없이 GPU의 연산량을 24%까지 줄인다. 둘째로 프레임 보간 기법을 활용한 그래픽 품질 향상 기법들에 대해서 제안한다. 최근의 프레임 보간 방식은 모션 보상 기반의 알고리즘 기반으로 중간프레임을 생성하기 때문에 요구되는 높은 비용은 모바일에서 적용할 수 없다. 이러한 문제를 개선하기 위해, 새로운 접근 방식인 GPU의 타일 렌더링을 이용한 중간 프레임 전달 방식의 프레임 보간 기법은 지연과 추가적인 높은 비용 없이 중간 프레임을 생성한다. 제안하는 기법을 통해 기존의 연구들 대비 시스템 관점에서 절반의 연산 비용으로 사람의 인지 능력 관점에서 동등한 그래픽 품질을 얻을 수 있다. 마지막으로 가장 최근에 발표된 OpenGL ES 3.0에서 제안된 기술인 Multi render target(MRT) 기술을 재사용 관점에서 최적화하기 위한 방법을 제안한다. MRT 는 지연 쉐이딩을 통한 복잡한 라이팅 연산을 효율적으로 처리하기 위해 많이 사용되는 기술이다. 하지만, 한꺼번에 렌더 타깃에 렌더링을 해야 하기 때문에 큰 메모리 대역폭을 요구한다, 이러한 문제는 제한된 모바일 환경에서는 큰 장애이다. 이 문제를 개선하기 위해 시간적 중복성을 이용한 데이터 재사용을 통해 이미 쓰인 렌더 타깃의 데이터를 선택적으로 재사용하여 GPU의 연산량 및 메모리 사용을 감소시킨다. 실험을 통해 사람의 인지 능력 과점에서 그래픽 품질을 유지하면서 18%의 시스템 레벨의 전력 소모 감소를 얻을 수 있다.제 1 장 서 론 1 1.1 연구 목적 1 1.2 연구 공헌 3 1.3 논문 구성 7 제 2 장 연구 배경 10 2.1 모바일 그래픽스의 발전 10 2.1.1 모바일 그래픽스 하드웨어의 진화 10 2.1.2 모바일 그래픽스 소프트웨어 진화 14 2.2 모바일 환경의 소모 전력 분석 19 2.3 해상도 23 2.4 프레임 속도 25 2.5 데이터 중복 26 제 3 장 가변 해상도 기반 최적화 29 3.1 거리 기반 가변 해상도 변환 기법 29 3.2 응용 프로그램 특성 기반 해상도 변환 기법 32 3.3 동적 렌더링 기반 전력 소모 최적화 및 품질 개선 33 3.3.1 인간 시각 시스템 기반 동적 렌더링 35 3.3.2 변환 행렬을 통한 변화량 계산 38 3.3.3 그래픽 품질 개선 스케일링 44 제 4 장 프레임 속도 기반 최적화 47 4.1 프레임 보간 47 4.2 정방향 재 투영 기법 49 4.3 역방향 재 투영 기법 52 4.4 폐색 영역 처리 및 한계 54 4.5 인간 시각 시스템 기반 홀드-타입 뭉개짐 55 4.6 타일 기반 GPU의 전력 소모 최적화 및 품질 개선 58 4.6.1 타일 기반 렌더링 60 4.6.2 중간 프레임 전달 기법 기반 프레임 속도 증가 63 4.6.3 인간 시각 시스템 기반 프레임 분석 69 4.6.4 렌더링 우선순위 계산 및 합성 72 제 5 장 데이터 재사용을 통한 최적화 77 5.1 데이터 재사용 77 5.2 멀티 렌더 타깃의 데이터 재사용을 통한 최적화 78 5.2.1 멀티 렌더 타깃 80 5.2.2 시간적 일관성 기반 데이터 재사용 83 5.2.3 렌더 타깃 저장 87 5.2.4 인간 시각 시스템 기반 렌터 타깃 재사용 88 제 6 장 성능 분석 92 6.1 실험 환경 93 6.1.1 구현 및 환경 93 6.1.2 실험 벡터 95 6.1.3 시각 시스템 기반 화질 평가 기준 96 6.2 성능 및 소모 전력 평가 99 6.2.1 프레임 간 변화량을 이용한 동적 렌더링 기법 99 6.2.2 타일 기반 GPU 를 위한 프레임 속도 증가 기법 104 6.2.3 멀티 렌더 타깃을 위한 데이터 재사용 기법 108 제 7 장 결론 115 참고 문헌 118 Abstract 126Docto

    Hardware Accelerators for Animated Ray Tracing

    Get PDF
    Future graphics processors are likely to incorporate hardware accelerators for real-time ray tracing, in order to render increasingly complex lighting effects in interactive applications. However, ray tracing poses difficulties when drawing scenes with dynamic content, such as animated characters and objects. In dynamic scenes, the spatial datastructures used to accelerate ray tracing are invalidated on each animation frame, and need to be rapidly updated. Tree update is a complex subtask in its own right, and becomes highly expensive in complex scenes. Both ray tracing and tree update are highly memory-intensive tasks, and rendering systems are increasingly bandwidth-limited, so research on accelerator hardware has focused on architectural techniques to optimize away off-chip memory traffic. Dynamic scene support is further complicated by the recent introduction of compressed trees, which use low-precision numbers for storage and computation. Such compression reduces both the arithmetic and memory bandwidth cost of ray tracing, but adds to the complexity of tree update.This thesis proposes methods to cope with dynamic scenes in hardware-accelerated ray tracing, with focus on reducing traffic to external memory. Firstly, a hardware architecture is designed for linear bounding volume hierarchy construction, an algorithm which is a basic building block in most state-of-the-art software tree builders. The algorithm is rearranged into a streaming form which reduces traffic to one-third of software implementations of the same algorithm. Secondly, an algorithm is proposed for compressing bounding volume hierarchies in a streaming manner as they are output from a hardware builder, instead of performing compression as a postprocessing pass. As a result, with the proposed method, compression reduces the overall cost of tree update rather than increasing it. The last main contribution of this thesis is an evaluation of shallow bounding volume hierarchies, common in software ray tracing, for use in hardware pipelines. These are found to be more energy-efficient than binary hierarchies. The results in this thesis both confirm that dynamic scene support may become a bottleneck in real time ray tracing, and add to the state of the art on tree update in terms of energy-efficiency, as well as the complexity of scenes that can be handled in real time on resource-constrained platforms
    corecore