Search CORE

145 research outputs found

Streaming Model Based Volume Ray Casting Implementation for Cell Broadband Engine

Author
Publication venue: 'Hindawi Limited'
Publication date: 01/01/2009
Field of study

GPU Cost Estimation for Load Balancing in Parallel Ray Tracing

Author: Biagio Cosenza
Carsten Dachsbacher
ERRA UGO
Publication venue: 'Scitepress'
Publication date: 01/01/2013
Field of study

Interactive ray tracing has seen enormous progress in recent years. However, advanced rendering techniques requiring many million rays per second are still not feasible at interactive speed, and are only possible by means of highly parallel ray tracing. When using compute clusters, good load balancing is crucial in order to fully exploit the available computational power, and to not suffer from the overhead involved by synchronization barriers. In this paper, we present a novel GPU method to compute a costmap: a per-pixel cost estimate of the ray tracing rendering process. We show that the cost map is a powerful tool to improve load balancing in parallel ray tracing, and it can be used for adaptive task partitioning and enhanced dynamic load balancing. Its effectiveness has been proven in a parallel ray tracer implementation tailored for a cluster of workstations

Archivio della Ricerca - Università della Basilicata

Experiences with Mesh-like computations using Prediction Binary Trees

Author: CORDASCO G
COSENZA B
DE CHIARA R
ERRA UGO
SCARANO V.
Publication venue: Warszawa : Szkoła Wyższa Psychologii Społecznej.
Publication date: 01/01/2009
Field of study

In this paper we aim at exploiting the temporal coherence among successive phases of a computation, in order to implement a load-balancing technique in mesh-like computations to be mapped on a cluster of processors. A key concept, on which the load balancing schema is built on, is the use of a Predictor component that is in charge of providing an estimation of the unbalancing between successive phases. By using this information, our method partitions the computation in balanced tasks through the Prediction Binary Tree (PBT). At each new phase, current PBT is updated by using previous phase computing time for each task as next phase's cost estimate. The PBT is designed so that it balances the load across the tasks as well as reduces {\em dependency} among processors for higher performances. Reducing dependency is obtained by using rectangular tiles of the mesh, of almost-square shape (i. e. one dimension is at most twice the other). By reducing dependency, one can reduce inter-processors communication or exploit local dependencies among tasks (such as data locality). Furthermore, we also provide two heuristics which take advantage of data-locality. Our strategy has been assessed on a significant problem, Parallel Ray Tracing. Our implementation shows a good scalability, and improves performance in both cheaper commodity cluster and high performance clusters with low latency networks. We report different measurements showing that tasks granularity is a key point for the performances of our decomposition/mapping strategy

Archivio della Ricerca - Università della Basilicata

Archivio della Ricerca - Università di Salerno

Archivio Istituzionale della Ricerca - Università degli Studi della Campania "Luigi Vanvitelli"

eXCloud: Transparent runtime support for scaling mobile applications in cloud

Author: Lam KT
Ma RKK
Wang CL
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2011
Field of study

Cloud computing augments applications with ease-of-access to the enormous resources on the Internet. Combined with mobile computing technologies, mobile applications can exploit the Cloud everywhere by statically distributing code segments or dynamically migrating running processes onto cloud services. Existing migration techniques are however too coarse-grained for mobile devices, so the overheads often offset the benefits of migration. To build a truly elastic mobile cloud computing infrastructure, we introduce eXCloud (eXtensible Cloud) - a middleware system with multi-level mobility support, ranging from as coarse as a VM instance to as fine as a runtime stack frame, and allows resources to be integrated and used dynamically. In eXCloud, a stack-on-demand (SOD) approach is used to support computation mobility throughout the mobile cloud environment. The approach is fully adaptive, goal-driven and transparent. By downward task migration, applications running on the cloud nodes can exploit or take control of special resources in mobile devices such as GPS and cameras. With a restorable MPI layer, task migrations of MPI parallel programs can happen between cloud nodes or be initiated from a mobile device. Our evaluation shows that SOD outperforms several existing migration mechanisms in terms of migration overhead and latency. All our techniques result in better resource utilization through task migrations among cloud nodes and mobile nodes.published_or_final_versionThe 2011 International Conference on Cloud and Service Computing (CSC), Hong Kong, China, 12-14 December 2011. In Proceedings of CSC, 2011, p. 103-11

HKU Scholars Hub

MemPool: A Scalable Manycore Architecture with a Low-Latency Shared L1 Memory

Author: Andri Renzo
Benini Luca
Cavalcante Matheus
Riedel Samuel
Publication venue
Publication date: 30/03/2023
Field of study

Shared L1 memory clusters are a common architectural pattern (e.g., in GPGPUs) for building efficient and flexible multi-processing-element (PE) engines. However, it is a common belief that these tightly-coupled clusters would not scale beyond a few tens of PEs. In this work, we tackle scaling shared L1 clusters to hundreds of PEs while supporting a flexible and productive programming model and maintaining high efficiency. We present MemPool, a manycore system with 256 RV32IMAXpulpimg "Snitch" cores featuring application-tunable functional units. We designed and implemented an efficient low-latency PE to L1-memory interconnect, an optimized instruction path to ensure each PE's independent execution, and a powerful DMA engine and system interconnect to stream data in and out. MemPool is easy to program, with all the cores sharing a global view of a large, multi-banked, L1 scratchpad memory, accessible within at most five cycles in the absence of conflicts. We provide multiple runtimes to program MemPool at different abstraction levels and illustrate its versatility with a wide set of applications. MemPool runs at 600 MHz (60 gate delays) in typical conditions (TT/0.80V/25{\deg}C) in 22 nm FDX technology and achieves a performance of up to 229 GOPS or 192 GOPS/W with less than 2% of execution stalls.Comment: 14 pages, 17 figures, 2 table

arXiv.org e-Print Archive

Repository for Publications and Research Data

Efficient rendering of large 3-D and 4-D scalar fields

Author: Kim Jusub
Publication venue
Publication date: 09/05/2008
Field of study

Rendering volumetric data, as a compute/communication intensive and highly parallel application, represents the characteristics of future workloads for desktop computers. Interactively rendering volumetric data has been a challenging problem due to its high computational and communication requirements. With the consistent trend toward high resolution data, it has remained a difficult problem despite the continuous increase in processing power, because of the increasing performance gap between computation and communication. On the other hand, the new multi-core architecture trend in computational units in PC, which can be characterized by parallelism and heterogeneity, provides both opportunities and challenges. While the new on-chip parallel architectures offer opportunities for extremely high performance, widespread use of those parallel processors requires extensive changes in previous algorithms to take advantage of the new architectures. In this dissertation, we develop new methods and techniques to support interactive rendering of large volumetric data. In particular, we present a novel method to layout data on disk for efficiently performing an out-of-core axis-aligned slicing of large multidimensional scalar fields. We also present a new method to efficiently build an out-of-core indexing structure for n-dimensional volumetric data. Then, we describe a streaming model for efficiently implementing volume ray casting on a heterogeneous compute resource environment. We describe how we implement the model on SONY/TOSHIBA/IBM Cell Broadband Engine and on NVIDIA CUDA architecture. Our results show that our out-of-core techniques significantly reduce the communication bandwidth requirements and that our streaming model very effectively makes use of the strengths of those heterogeneous parallel compute resource environment for volume rendering. In all cases, we achieve scalability and load balancing, while hiding memory latency

Digital Repository at the University of Maryland

Efficient distributed load balancing for parallel algorithms

Author: Cosenza Biagio
Publication venue: Universita degli studi di Salerno
Publication date: 29/04/2011
Field of study

2009 - 2010With the advent of massive parallel processing technology, exploiting the power offered by hundreds, or even thousands of processors is all but a trivial task. Computing by using multi-processor, multi-core or many-core adds a number of additional challenges related to the cooperation and communication of multiple processing units. The uneven distribution of data among the various processors, i.e. the load imbalance, represents one of the major problems in data parallel applications. Without good load distribution strategies, we cannot reach good speedup, thus good efficiency. Load balancing strategies can be classified in several ways, according to the methods used to balance workload. For instance, dynamic load balancing algorithms make scheduling decisions during the execution and commonly results in better performance compared to static approaches, where task assignment is done before the execution. Even more important is the difference between centralized and distributed load balancing approaches. In fact, despite that centralized algorithms have a wider vision of the computation, hence may exploit smarter balancing techniques, they expose global synchronization and communication bottlenecks involving the master node. This definitely does not assure scalability with the number of processors. This dissertation studies the impact of different load balancing strategies. In particular, one of the key observations driving our work is that distributed algorithms work better than centralized ones in the context of load balancing for multi-processors (alike for multi-cores and many-cores as well). We first show a centralized approach for load balancing, then we propose several distributed approaches for problems having different parallelization, workload distribution and communication pattern. We try to efficiently combine several approaches to improve performance, in particular using predictive metrics to obtain a per task compute-time estimation, using adaptive subdivision, improving dynamic load balancing and addressing distributed balancing schemas. The main challenge tackled on this thesis has been to combine all these approaches together in new and efficient load balancing schemas. We assess the proposed balancing techniques, starting from centralized approaches to distributed ones, in distinctive real case scenarios: Mesh-like computation, Parallel Ray Tracing, and Agent-based Simulations. Moreover, we test our algorithms with parallel hardware such has cluster of workstations, multi-core processors and exploiting SIMD vectorial instruction set. Finally, we conclude the thesis with several remarks, about the impact of distributed techniques, the effect of the communication pattern and workload distribution, the use of cost estimation for adaptive partitioning, the trade-off fast versus accuracy in prediction-based approaches, the effectiveness of work stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU computations. [edited by author]IX n.s

EleA@UniSA - Università degli Studi di Salerno

Realtime ray tracing and interactive global illumination

Author: Wald Ingo
Publication venue: Fakultät 6 - Naturwissenschaftlich-Technische Fakultät I. Fachrichtung 6.2 - Informatik
Publication date: 01/01/2004
Field of study

One of the most sought-for goals in computer graphics is to generate "realism in real time". i.e. the generation of realistically looking images at realtime frame rates. Today, virtually all approaches towards realtime rendering use graphics hardware, which is based almost exclusively on triangle rasterization. Unfortunately, though this technology has seen tremendous progress over the last few years, for many applications it is currently reaching its limits in both model complexity, supported features, and achievable realism. An alternative to triangle rasterizations is the ray tracing algorithm, which is well-known for its higher flexibility, its generally higher achievable realism, and its superior scalability in both model size and compute power. However, ray tracing is also computationally demanding and thus so far is used almost exclusively for high-quality offline rendering tasks. This dissertation focuses on the question why ray tracing is likely to soon play a larger role for interactive applications, and how this scenario can be reached. To this end, we discuss the RTRT/OpenRT realtime ray tracing system, a software based ray tracing system that achieves interactive to realtime frame rates on todays commodity CPUs. In particular, we discuss the overall system design, the efficient implementation of the core ray tracing algorithms, techniques for handling dynamic scenes, an efficient parallelization framework, and an OpenGL-like low-level API. Taken together, these techniques form a complete realtime rendering engine that supports massively complex scenes, highley realistic and physically correct shading, and even physically based lighting simulation at interactive rates. In the last part of this thesis we then discuss the implications and potential of realtime ray tracing on global illumination, and how the availability of this new technology can be leveraged to finally achieve interactive global illumination - the physically correct simulation of light transport at interactive rates.Eines der wichtigsten Ziele der Computer-Graphik ist die Generierung von "Realismus in Echtzeit\u27; — die Erzeugung von realistisch wirkenden, computer- generierten Bildern in Echtzeit. Heutige Echtzeit-Graphikanwendungen werden derzeit zum überwiegenden Teil mit schneller Graphik-Hardware realisiert, welche zum aktuellen Stand der Technik fast ausschliesslich auf dem Dreiecksrasterisierungsalgorithmus basiert. Obwohl diese Rasterisierungstechnologie in den letzten Jahren zunehmend beeindruckende Fortschritte gemacht hat, stößt sie heutzutage zusehends an ihre Grenzen, speziell im Hinblick auf Modellkomplexität, unterstützte Beleuchtungseffekte, und erreichbaren Realismus. Eine Alternative zur Dreiecksrasterisierung ist das "Ray-Tracing\u27; (Stahl-Rückverfolgung), welches weithin bekannt ist für seine höhere Flexibilität, seinen im Großen und Ganzen höheren erreichbaren Realismus, und seine bessere Skalierbarkeit sowohl in Szenengröße als auch in Rechner-Kapazitäten. Allerdings ist Ray-Tracing ebenso bekannt für seinen hohen Rechenbedarf, und wird daher heutzutage fast ausschließlich für die hochqualitative, nichtinteraktive Bildsynthese benutzt. Diese Dissertation behandelt die Gründe warum Ray-Tracing in näherer Zukunft voraussichtlich eine größere Rolle für interaktive Graphikanwendungen spielen wird, und untersucht, wie dieses Szenario des Echtzeit Ray-Tracing erreicht werden kann. Hierfür stellen wir das RTRT/OpenRT Echtzeit Ray-Tracing System vor, ein software-basiertes Ray-Tracing System, welches es erlaubt, interaktive Performanz auf heutigen Standard-PC-Prozessoren zu erreichen. Speziell diskutieren wir das grundlegende System-Design, die effiziente Implementierung der Kern-Algorithmen, Techniken zur Unterstützung von dynamischen Szenen, ein effizientes Parallelisierungs-Framework, und eine OpenGL-ähnliche Anwendungsschnittstelle. In ihrer Gesamtheit formen diese Techniken ein komplettes Echtzeit-Rendering-System, welches es erlaubt, extrem komplexe Szenen, hochgradig realistische und physikalisch korrekte Effekte, und sogar physikalisch-basierte Beleuchtungssimulation interaktiv zu berechnen. Im letzten Teil der Dissertation behandeln wir dann die Implikationen und das Potential, welches Echtzeit Ray-Tracing für die Globale Beleuchtungssimulation bietet, und wie die Verfügbarkeit dieser neuen Technologie benutzt werden kann, um letztendlich auch Globale Belechtung — die physikalisch korrekte Simulation des Lichttransports — interaktiv zu berechnen

Universaar

MPG.PuRe

Acronym

Masivně paralelní implementace algoritmů počítačové grafiky

Author: Jaroš Milan
Publication venue: Vysoká škola báňská – Technická univerzita Ostrava
Publication date: 01/01/2021
Field of study

Computer graphics, since its inception in the 1960s, has made great progress. It has become part of everyday life. We can see it all around us, from smartwatches and smartphones, where graphic accelerators are already part of the chips and can render not only interactive menus but also demanding graphic applications, to laptops and personal computers as well as to high-performance visualization servers and supercomputers that can display demanding simulations in real time. In this dissertation we focus on one of the most computationally demanding area of computer graphics and that is the computation of global illumination. One of the most widely used methods for simulating global illumination is the path tracing method. Using this method, we can visualize, for example, scientific or medical data. The path tracing method can be accelerated using multiple graphical accelerators, which we will focus on in this work. We will present a solution for path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of the path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal performance impact. The key concept is that the parts of the scene that have the highest number of memory accesses are replicated across all GPUs. We present two methods for maximizing the performance of path tracing when dealing with partially distributed scene data. Both methods operate at the memory management level, and therefore the path tracing data structures do not need to be redesigned. We implemented this new out-of-core mechanism in the open-source Blender Cycles path tracer, which we also extended with technologies that support running on supercomputers and can take advantage of all accelerators allocated on multiple nodes. In this work, we also introduce a new service that uses our extended version of the Blender Cycles renderer to simplify sending and running jobs directly from Blender.Počítačová grafika od svého vzniku v 60. letech 20. století udělala velký pokrok. Stala se součástí každodenního života. Můžeme ji vidět všude kolem nás, od chytrých hodinek a smartphonů, kde jsou grafické akcelerátory již součástí čipů a dokáží vykreslovat nejen interaktivní menu, ale i náročné grafické aplikace, přes notebooky a osobní počítače až po výkonné vizualizační servery nebo superpočítače, které dokáží zobrazovat náročné simulace v reálném čase. V této disertační práci se zaměříme na jednu z výpočetně nejnáročnějších oblastí počítačové grafiky, a tou je výpočet globálního osvětlení. Jednou z nejpoužívanějších metod pro simulaci globálního osvětlení je metoda sledování cesty. Pomocí této metody můžeme vizualizovat např. vědecká nebo lékařská data. Metodu sledování cest lze urychlit pomocí několika grafických akcelerátorů, na které se v této práci zaměříme. Představíme řešení pro vykreslování masivních scén na více GPU. Náš přístup analyzuje vzory přístupů k paměti a definuje, jak by měla být data scény rozdělena mezi grafickými akcelerátory s minimální ztrátou výkonu. Klíčovým konceptem je, že části scény, které mají nejvyšší počet přístupů do paměti, jsou replikovány na všech grafických akcelerátorech. Představíme dvě metody pro maximalizaci výkonu vykreslování při práci s částečně distribuovanými daty scény. Obě metody pracují na úrovni správy paměti, a proto není třeba datové struktury přepracovávat. Tento nový out-of-core mechanismus jsme implementovali do open-source path traceru Blender Cycles, který jsme také rozšířili o technologie podporující běh na superpočítačích a schopné využít všechny akcelerátory alokované na více uzlech. V této práci také představíme novou službu, která využívá naši rozšířenou verzi Blender Cycles a zjednodušuje odesílání a spouštění úloh přímo z programu Blender.96220 - Laboratoř pro výzkum infrastrukturyvyhově

DSpace at VSB Technical University of Ostrava

A domain-specific high-level programming model

Author: Balarin
Bhattacharya
Blumofe
Board
Boulos
Cameron
Chandra
Grotker
Hiram
Houzet
Kirk
Lee
Munshi
Pacheco
Parhi
Polukhin
Reinders
Sanders
Skillicorn
Valiant
Publication venue: 'Wiley'
Publication date: 22/09/2015
Field of study

International audienceNowadays, computing hardware continues to move toward more parallelism and more heterogeneity, to obtain more computing power. From personal computers to supercomputers, we can find several levels of parallelism expressed by the interconnections of multi-core and many-core accelerators. On the other hand, computing software needs to adapt to this trend, and programmers can use parallel programming models (PPM) to fulfil this difficult task. There are different PPMs available that are based on tasks, directives, or low level languages or library. These offer higher or lower abstraction levels from the architecture by handling their own syntax. However, to offer an efficient PPM with a greater (additional) high-levelabstraction level while saving on performance, one idea is to restrict this to a specific domain and to adapt it to a family of applications. In the present study, we propose a high-level PPM specific to digital signal processing applications. It is based on data-flow graph models of computation, and a dynamic runtime model of execution (StarPU). We show how the user can easily express this digital signal processing application, and can take advantage of task, data and graph parallelism in the implementation, to enhance the performances of targeted heterogeneous clusters composed of CPUs and different accelerators (e.g., GPU, Xeon Phi

Crossref

Hal - Université Grenoble Alpes