Search CORE

283 research outputs found

The LPGPU2 Project: Low-Power Parallel Computing on GPUs : Extended Abstract

Author: Bliss Martyn
Juurlink Ben
Keramidas Georgios
Kokkala Chrysa
Lucas Jan
Mammeri Nadjib
Richards Andrew
Publication venue
Publication date: 01/01/2017
Field of study

The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. First, several applications (use cases) need to be developed for or ported to low-power GPUs. Thereafter, these applications need to be optimized using the tooling framework. In addition, power measurement devices and power models need to be developed that are 10x more accurate than the state of the art. The project consortium actively promotes open vendor-neutral standards via the Khronos group. This paper briefly reports on the achievements made in the first half of the project, and focuses on the progress made in applications; in power measurement, estimation, and modelling; and in the analysis and visualization tool suite.EC/H2020/688759/EU/Low-Power Parallel Computing on GPUs 2/LPGPU

DepositOnce

PoCL-R: An Open Standard Based Offloading Layer for Heterogeneous Multi-Access Edge Computing with Server Side Scalability

Author: Babej Michal
Ikkala Julius
Jääskeläinen Pekka
Solanti Jan
Publication venue
Publication date: 01/09/2023
Field of study

We propose a novel computing runtime that exposes remote compute devices via the cross-vendor open heterogeneous computing standard OpenCL and can execute compute tasks on the MEC cluster side across multiple servers in a scalable manner. Intermittent UE connection loss is handled gracefully even if the device's IP address changes on the way. Network-induced latency is minimized by transferring data and signaling command completions between remote devices in a peer-to-peer fashion directly to the target server with a streamlined TCP-based protocol that yields a command latency of only 60 microseconds on top of network round-trip latency in synthetic benchmarks. The runtime can utilize RDMA to speed up inter-server data transfers by an additional 60% compared to the TCP-based solution. The benefits of the proposed runtime in MEC applications are demonstrated with a smartphone-based augmented reality rendering case study. Measurements show up to 19x improvements to frame rate and 17x improvements to local energy consumption when using the proposed runtime to offload AR rendering from a smartphone. Scalability to multiple GPU servers in real-world applications is shown in a computational fluid dynamics simulation, which scales with the number of servers at roughly 80% efficiency which is comparable to an MPI port of the same simulation.Comment: 13 pages, 17 figure

arXiv.org e-Print Archive

Embedded User Interface for Mobile Applications to Satisfy Design for All Principles

Author: Evangelos Bekiaris
Kostantinos Kalogirou
Maria Gemou
Publication venue: 'IntechOpen'
Publication date: 01/05/2010
Field of study

IntechOpen

A Performance-Portable SYCL Implementation of CRK-HACC for Exascale

Author: Frontiere Nicholas
Ma Zhiqiang
Madananth Varsha
Pennycook S. John
Pope Adrian
Rangel Esteban M.
Publication venue
Publication date: 24/10/2023
Field of study

The first generation of exascale systems will include a variety of machine architectures, featuring GPUs from multiple vendors. As a result, many developers are interested in adopting portable programming models to avoid maintaining multiple versions of their code. It is necessary to document experiences with such programming models to assist developers in understanding the advantages and disadvantages of different approaches. To this end, this paper evaluates the performance portability of a SYCL implementation of a large-scale cosmology application (CRK-HACC) running on GPUs from three different vendors: AMD, Intel, and NVIDIA. We detail the process of migrating the original code from CUDA to SYCL and show that specializing kernels for specific targets can greatly improve performance portability without significantly impacting programmer productivity. The SYCL version of CRK-HACC achieves a performance portability of 0.96 with a code divergence of almost 0, demonstrating that SYCL is a viable programming model for performance-portable applications.Comment: 12 pages, 13 figures, 2023 International Workshop on Performance, Portability & Productivity in HP

arXiv.org e-Print Archive

Flexible, Open and Efficient Embedded Multimedia Systems

Author: David de la Fuente
Fernando Rincón
Jesús Barba
Juan Carlos López
Julio Daniel Dondo
Publication venue: 'IntechOpen'
Publication date: 16/03/2012
Field of study

IntechOpen

Cross-Language Interoperability of Heterogeneous Code

Author: Blanaru Florin
Fumero Alfonso Juan
Kotselidis Christos-Efthymios
Papadakis Orion
Stratikopoulos Athanasios
Xekalaki Maria
Publication venue
Publication date: 05/02/2023
Field of study

The University of Manchester - Institutional Repository

Figurations of Timely Extraction

Author: Pritchard H
Rocha J
Snelting F
Publication venue: Media Theory
Publication date: 20/12/2020
Field of study

No embargo required

HAL Descartes

Plymouth Electronic Archive and Research Library

Hal-Diderot

Convolution of large 3D images on GPU and its decomposition

Author: David Svoboda
Pavel Karas
Publication venue: Springer Nature
Publication date: 01/01/2011
Field of study

Springer - Publisher Connector

Concepção e realização de um framework para sistemas embarcados baseados em FPGA aplicado a um classificador Floresta de Caminhos Ótimos

Author: Diniz Wendell Fioravante da Silva, 1982-
Publication venue: [s.n.]
Publication date: 02/09/2018
Field of study

Orientadores: Eurípedes Guilherme de Oliveira Nóbrega, Isabelle Fantoni-Coichot, Vincent FrémontTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Mecânica, Université de Technologie de CompiègneResumo: Muitas aplicações modernas dependem de métodos de Inteligência Artificial, tais como classificação automática. Entretanto, o alto custo computacional associado a essas técnicas limita seu uso em plataformas embarcadas com recursos restritos. Grandes quantidades de dados podem superar o poder computacional disponível em tais ambientes, o que torna o processo de projetá-los uma tarefa desafiadora. As condutas de processamento mais comuns usam muitas funções de custo computacional elevadas, o que traz a necessidade de combinar alta capacidade computacional com eficiência energética. Uma possível estratégia para superar essas limitações e prover poder computacional suficiente aliado ao baixo consumo de energia é o uso de hardware especializado como, por exemplo, FPGA. Esta classe de dispositivos é amplamente conhecida por sua boa relação desempenho/consumo, sendo uma alternativa interessante para a construção de sistemas embarcados eficazes e eficientes. Esta tese propõe um framework baseado em FPGA para a aceleração de desempenho de um algoritmo de classificação a ser implementado em um sistema embarcado. A aceleração do desempenho foi atingida usando o esquema de paralelização SIMD, aproveitando as características de paralelismo de grão fino dos FPGA. O sistema proposto foi implementado e testado em hardware FPGA real. Para a validação da arquitetura, um classificador baseado em Teoria dos Grafos, o OPF, foi avaliado em uma proposta de aplicação e posteriormente implementado na arquitetura proposta. O estudo do OPF levou à proposição de um novo algoritmo de aprendizagem para o mesmo, usando conceitos de Computação Evolutiva, visando a redução do tempo de processamento de classificação, que, combinada à implementação em hardware, oferece uma aceleração de desempenho suficiente para ser aplicada em uma variedade de sistemas embarcadosAbstract: Many modern applications rely on Artificial Intelligence methods such as automatic classification. However, the computational cost associated with these techniques limit their use in resource constrained embedded platforms. A high amount of data may overcome the computational power available in such embedded environments while turning the process of designing them a challenging task. Common processing pipelines use many high computational cost functions, which brings the necessity of combining high computational capacity with energy efficiency. One of the strategies to overcome this limitation and provide sufficient computational power allied with low energy consumption is the use of specialized hardware such as FPGA. This class of devices is widely known for their performance to consumption ratio, being an interesting alternative to building capable embedded systems. This thesis proposes an FPGA-based framework for performance acceleration of a classification algorithm to be implemented in an embedded system. Acceleration is achieved using SIMD-based parallelization scheme, taking advantage of FPGA characteristics of fine-grain parallelism. The proposed system is implemented and tested in actual FPGA hardware. For the architecture validation, a graph-based classifier, the OPF, is evaluated in an application proposition and afterward applied to the proposed architecture. The OPF study led to a proposition of a new learning algorithm using evolutionary computation concepts, aiming at classification processing time reduction, which combined to the hardware implementation offers sufficient performance acceleration to be applied in a variety of embedded systemsDoutoradoMecanica dos Sólidos e Projeto MecanicoDoutor em Engenharia Mecânica3077/2013-09CAPE

Repositorio da Producao Cientifica e Intelectual da Unicamp

Sparse Matrix Algorithms Using GPGPU

Author: Kuresson Kaupo
Publication venue: Tartu Ülikool
Publication date: 01/01/2012
Field of study

Antud bakalauresetöö eesmärgiks oli lahendada võimalikult efektiivselt suuremahulisi arvutusi nõudvaid ülesandeid, kasutades selleks GPGPU’d ehk üldotstarbelist arvutamist graafikakaartidel. Konkreetse näitena vaadeldi hõreda maatriksi ning vektori korrutise leidmist. Maatriksi ja vektori korrutamine on aluseks paljudele algoritmidele – näiteks pilditöötlus ja masinõpe. Hõre maatriks on maatriks, mille enamus elemente on nullid. Kuna nullid vektoriga korrutamisel lõpptulemust ei muuda, on eesmärgiks vältida ebavajalikku nullide korrutamist. Selle saavutamiseks saab muuta kasutatavat algoritmi ja viisi, kuidas maatriksit salvestatakse. Lõputöö käigus testiti nelja erinevat hõreda maatriksi salvestamise formaati. Vaatluse all oli formaatide eripärasid arvestades loodud maatriksi ja vektori korrutamise algoritmide jõudlus ja mäluvajadus. Formaatideks olid „täielik“, „koordinaadipõhine“, „ELLPACK“, „pakitud hõredad read“ ja „pakitud diagonaalid“. Eesmärgiks oli hinnata ka algoritmide jõudluserinevust protsessori ja graafikakaardi rakendamisel. Algoritmide realiseerimiseks kasutati OpenCL’i. OpenCl on raamistik, mille abil saab kirjutada programme, mis võivad käskude täitmiseks kasutada nii protsessoreid kui ka graafikakaarte. Põhiliseks raskuseks on sealjuures ülesande jagamine väiksemateks osadeks, et neid saaks lahendada paralleelselt ja arvutusjõudlust efektiivsemalt ära kasutada. Teste jooksutati autori lauaarvutil ja EENeti (Eesti Hariduse ja Teaduse Andmesidevõrk) arvutussõlmedel. EENeti kaudu avanes lisavõimalus proovida arvutusülesannete jagamist kahe graafikakaardi vahel. „Täielikku“ salvestusformaati kasutades oli OpenCL-i kasutamine tavalise C++ koodiga võrreldes 3,6 korda kiirem. Keerukamate formaatide puhul oli jõudluse kasv veelgi märgatavam. Tulemustest ilmnes, et jõudlus sõltub suuresti maatriksite struktuurist ja kasutatud riistvarast. Näiteks sai „koordinaadipõhine“ formaat nVidia graafikakaardil ATI omaga võrreldes ligi 30 korda halvemaid tulemusi. ELLPACK formaadi puhul andis nVidia kaardile lisajõudlust vektori tekstuurina esitamine. ATI kaart sai aga võrreldes vektori tavalise esitusega poole võrra halvema tulemuse. Testide põhjal tundus universaalse lahendusena parim „pakitud hõredad read“ formaat, mis andis parima tulemuse kõigi maatriksite puhul. See võib aga uute maatriksite valikul muutuda. Algoritmide kahe graafikakaardi vahel jagamine tagas suuremate elementide arvuga maatriksite puhul kiiruse kasvu kuni 60%. Teise seadme kasutamisel peab arvestama väljakutsete suurema arvu ja lisakulude kasvuga. See tähendab, et väiksemate maatriksite puhul, kus arvutamine võtab vähem aega, ei pruugi jõudlus kahe seadmega suureneda. Testitulemustest oligi näha, et väiksemate maatriksite puhul oli kahe graafikakaardiga saadud tulemus aeglasem, kui ühega.The purpose of this thesis was to benchmark and compare different representations of sparse matrices and algorithms for multiplying them with a vector. Also, to see the performance differences of running the algorithms on a CPU and GPU(s). Four different storage formats were tested – full matrix storage, coordinate storage (COO), ELLPACK (ELL), compressed sparse row (CSR) and compressed diagonal storage (CDS). Performance tests were run on a desktop computer and also on EENet (the Estonian Education and Research Network) worknodes. The EENet worknodes added the opportunity for dividing the workload between their two GPUs. Using OpenCL gave a speedup of 3,6 times over pure C++ code when using the full storage method and basic algorithm. With more complex storage formats, the speed gain was even more distinct. Converting the vectors into images did not give the expected speedup on most cases. Still, it performed slightly better on the nVidia hardware. An option would be to create both kinds on kernels in a situation like this – one with image support, another with normal memory access, and see which one performs better. The conversion into textures requires only slight modifications of the kernel and host-code. The problem with using OpenCl is the need to effectively parallelize the original task and use as much of the available computing power as possible. As the results show, the performance is highly dependent on the type of matrix and hardware used. For an all-round choice the CSR seems to be the best, being the fastest in all tests. This may of course change with the selection of new matrices and further optimization of kernels. The performance benefit when using multiple devices also depends on the type of matrices used – with smaller ones, the additional overhead of creating a new command queue and kernel execution can nullify the advantage of more processing power. With larger matrices, speed ups of up to 60% were noted

DSpace at Tartu University Library