6,391 research outputs found
GPU Cost Estimation for Load Balancing in Parallel Ray Tracing
Interactive ray tracing has seen enormous progress in recent years. However, advanced rendering techniques requiring many million rays per second are still not feasible at interactive speed, and are only possible by means of highly parallel ray tracing. When using compute clusters, good load balancing is crucial in order to fully exploit the available computational power, and to not suffer from the overhead involved by synchronization barriers. In this paper, we present a novel GPU method to compute a costmap: a per-pixel cost estimate of the ray tracing rendering process. We show that the cost map is a powerful tool to improve load balancing in
parallel ray tracing, and it can be used for adaptive task partitioning and enhanced dynamic load balancing. Its effectiveness has been proven in a parallel ray tracer implementation tailored for a cluster of workstations
Parallel interactive ray tracing and exploiting spatial coherence
Dissertação de mestrado em Engenharia de InformáticaRay tracing is a rendering technique that allows simulating a wide range of light transport phenomena, resulting on highly realistic computer generated imaging. Ray tracing is, however, computationally very demanding, compared to other techniques such as rasterization that achieves shorter rendering times by greatly simplifying the physics of light propagation, at the cost of less realistic images.
The complexity of the ray tracing algorithm makes it unusable for interactive applications on machines without dedicated hardware, such as GPUs. The extreme task independent nature of the algorithm offers great potential for parallel processing, increasing the available computational power by using additional resources. This thesis studies different approaches and enhancements on the decomposition of workload and load balancing in a distributed shared memory cluster in order to achieve interactive frame rates.
This thesis also studies approaches to enhance the ray tracing algorithm, by reducing the computational demand without decreasing the quality of the results. To achieve this goal, optimizations that depend on the rays’ processing order were implemented. An alternative to the traditional image plan traversal order, scan line, is studied, using space-filling curves.
Results have shown linear speed-ups of the used ray tracer in a distributed shared memory cluster. They have also shown that spatial coherence can be used to increase the performance of the ray tracing algorithm and that the improvement depends of the traversal order of the image plane.O ray tracing é uma técnica de síntese de imagens que permite simular um vasto conjunto de
fenómenos da luz, resultando em imagens geradas por computador altamente realistas. O ray
tracing é, no entanto, computacionalmente muito exigente quando comparado com outras
técnicas tais como a rasterização, a qual consegue tempos de síntese mais baixos mas com
imagens menos realistas.
A complexidade do algoritmo de ray tracing torna o seu uso impossível para aplicações
interativas em máquinas que não disponham de hardware dedicado a esse tipo de
processamento, como os GPUs. No entanto, a natureza extremamente paralela do algoritmo
oferece um grande potencial para o processamento paralelo. Nesta tese são analisadas
diferentes abordagens e optimizações da decomposição das tarefas e balanceamento da carga
num cluster de memória distribuída, por forma a alcançar frame rates interativas.
Esta tese também estuda abordagens que melhoram o algoritmo de ray tracing, ao reduzir o
esforço computacional sem perder qualidade nos resultados. Para esse efeito, foram
implementadas optimizações que dependem da ordem pela qual os raios são processados.
Foi estudada, nomeadamente, uma travessia do plano da imagem alternativa à tradicional,
scan line, usando curvas de preenchimento espacial.
Os resultados obtidos mostraram aumento de desempenho linear do ray tracer utilizado num
cluster de memória distribuída. Demonstraram também que a coerência espacial pode ser
usada para melhorar o desempenho do algoritmo de ray tracing e que estas melhorias
dependem do algoritmo de travessia utilizado
A parallel progressive radiosity algorithm based on patch data circulation
Cataloged from PDF version of article.Current research on radiosity has concentrated on increasing the accuracy and the speed of the solution. Although algorithmic and meshing techniques decrease the execution time, still excessive computational power is required for complex scenes. Hence, parallelism can be exploited for speeding up the method further. This paper aims at providing a thorough examination of parallelism in the basic progressive refinement radiosity, and investigates its parallelization on distributed-memory parallel architectures. A synchronous scheme, based on static task assignment, is proposed to achieve better coherence for shooting patch selections. An efficient global circulation scheme is proposed for the parallel light distribution computations, which reduces the total volume of concurrent communication by an asymptotical factor. The proposed parallel algorithm is implemented on an Intel's iPSC/2 hypercube multicomputer. Load balance qualities of the proposed static assignment schemes are evaluated experimentally. The effect of coherence in the parallel light distribution computations on the shooting patch selection sequence is also investigated. Theoretical and experimental evaluation is also presented to verify that the proposed parallelization scheme yields equally good performance on multicomputers implementing the simplest (e.g. ring) as well as the richest (e.g. hypercube) interconnection topologies. This paper also proposes and presents a parallel load re-balancing scheme which enhances our basic parallel radiosity algorithm to be usable in the parallelization of radiosity methods adopting adaptive subdivision and meshing techniques. (C) 1996 Elsevier Science Lt
Data distributed, parallel algorithm for ray-traced volume rendering
Journal ArticleThis paper presents a divide-and-conquer ray-traced volume rendering algorithm and a parallel image compositing method, along with their implementation and performance on the connection Machine CM-5, and networked workstations. This algorithm distributes both the data and the computations to individual processing units to achieve fast, high-quality rendering of high-resolution data. The volume data, once distributed, is left intact. The processing nodes perform local raytracing of their sub volume concurrently. No communication between processing units is needed during this locally ray-tracing process. A subimage is generated by each processing unit and the final image is obtained by compositing subimages in the proper order, which can be determined a priori. Test results on the CM-5 and a group of networked workstations demonstrate the practicality of our rendering algorithm and compositing method
Application level runtime load management : a bayesian approach
A computação paralela em sistemas distribuídos partilhados exige novas abordagens ao problema da gestão da carga computacional, uma vez que os algoritmos existentes ficam aquém das expectativas. A execução eficiente de aplicações paralelas irregulares em clusters de computadores partilhados dinamicamente exibe um comportamento imprevisível, devido à à variabilidade dos requisitos da aplicação e da disponibilidade dos recursos do sistema. Esta tese investiga as vantagens de incluir explicitamente no modelo de execução de um escalonador ao nível da aplicação a incerteza que este tem sobre o estado do ambiente em cada instante. Propõe-se um mecanismo de decisão baseado em redes de decisão de Bayes, complementado por uma estrutura genérica para estas redes, vocacionada para o escalonamento ao nível da aplicação; a utilização de um algoritmo de inferência probabilística permite ao escalonador tomar decisões mais eficazes, baseadas em previsões escolásticas das consequências destas decisões, geradas a partir de informação incompleta e desactualizada sobre o estado do ambiente. É proposto um modelo de desempenho da aplicação e respectivas métricas, que permite prever o comportamento da aplicação e do sistema distribuído; estas métricas são utilizadas quer no mecanismo de decisão do escalonador, quer para avaliar o desempenho do mesmo. Para verificar se esta abordagem contribui para melhorar o tempo de execução das aplicações e a eficiência do escalonador, foi desenvolvido um ray tracer paralelo, representativo de uma classe de aplicações baseada em passagem de mensagens com paralelismo no domínio dos dados e comportamento irregular. Este protótipo foi executado num cluster com sete nodos partilhados no tempo e submetidos a vários padrões sintéticos de cargas de trabalho dinâmicas. Para avaliar a eficácia da gestão de carga proposta, o desempenho do escalonador estocástico foi comparado com três escalonadores de referência: uma distribuição estática e uniforme da carga, uma estratégia orientada ao pedido e uma política de escalonamento determinística baseada em sensores. Os resultados obtidos demonstram que estratégias dinâmicas baseadas em sensores obtêm grandes melhorias de desempenho sobre estratégias que não usam informação sobre o estado do ambiente, e realçam as vantagens do escalonador estocástico relativamente a um escalonador determinístico com um nível de complexidade equivalente.Affordable parallel computing on distributed shared systems requires novel approaches to manage the runtime load distribution, since current algorithms fall below expectations. The efficient execution of irregular parallel applications, on dynamically shared computing clusters, has an unpredictable dynamic behaviour, due both to the application requirements and to the available system's resources. This thesis addresses the explicit inclusion of the uncertainty an application level scheduling agent has about the environment, on its internal model of the world and on its decision making mechanism. Bayesian decision networks are introduced and a generic framework is proposed for application level scheduling, where a probabilistic inference algorithm helps the scheduler to efficiently make decisions with improved predictions, based on available incomplete and aged measured data. An application level performance model and associated metrics (performance, environment and overheads) are proposed to obtain application and system behaviour estimates, to include in the scheduling agent's model and to help the evaluation. To verify that this novel approach improves the overall application execution time and the scheduling efficiency, a parallel ray tracer was developed as a message passing irregular data parallel application, and an execution model prototype was built to run on a seven time-shared nodes computing cluster, with dynamically variable synthetic workloads. To assess the effectiveness of the load management, the stochastic scheduler was evaluated rendering several complex scenes, and compared with three reference scheduling strategies: a uniform work distribution, a demand driven work allocation and a sensor based deterministic scheduling strategy. The evaluation results show considerable performance improvements over blind strategies, and stress the decision network based scheduler improvements over the sensor based deterministic approach of identical complexity.Fundação para a Ciência e Tecnologia - PRAXIS XXI 2/2.1/TTT/1557/95
Parallel rendering algorithms for distributed-memory multicomputers
Ankara : Department of Computer Engineering and Information Science and the Institute of Engineering and Science of Bilkent University, 1997.Thesis (Ph. D.) -- Bilkent University, 1997.Includes bibliographical references leaves 166-176.Kurç, Tahsin MertefePh.D
Efficient distributed load balancing for parallel algorithms
2009 - 2010With the advent of massive parallel processing technology, exploiting the power
offered by hundreds, or even thousands of processors is all but a trivial task.
Computing by using multi-processor, multi-core or many-core adds a number of
additional challenges related to the cooperation and communication of multiple
processing units.
The uneven distribution of data among the various processors, i.e. the load
imbalance, represents one of the major problems in data parallel applications.
Without good load distribution strategies, we cannot reach good speedup, thus
good efficiency.
Load balancing strategies can be classified in several ways, according to the
methods used to balance workload. For instance, dynamic load balancing algorithms
make scheduling decisions during the execution and commonly results
in better performance compared to static approaches, where task assignment is
done before the execution.
Even more important is the difference between centralized and distributed
load balancing approaches. In fact, despite that centralized algorithms have
a wider vision of the computation, hence may exploit smarter balancing techniques,
they expose global synchronization and communication bottlenecks involving
the master node. This definitely does not assure scalability with the
number of processors.
This dissertation studies the impact of different load balancing strategies.
In particular, one of the key observations driving our work is that distributed
algorithms work better than centralized ones in the context of load balancing
for multi-processors (alike for multi-cores and many-cores as well).
We first show a centralized approach for load balancing, then we propose several
distributed approaches for problems having different parallelization, workload
distribution and communication pattern. We try to efficiently combine several
approaches to improve performance, in particular using predictive metrics
to obtain a per task compute-time estimation, using adaptive subdivision, improving
dynamic load balancing and addressing distributed balancing schemas.
The main challenge tackled on this thesis has been to combine all these approaches
together in new and efficient load balancing schemas.
We assess the proposed balancing techniques, starting from centralized approaches
to distributed ones, in distinctive real case scenarios: Mesh-like computation,
Parallel Ray Tracing, and Agent-based Simulations. Moreover, we
test our algorithms with parallel hardware such has cluster of workstations,
multi-core processors and exploiting SIMD vectorial instruction set.
Finally, we conclude the thesis with several remarks, about the impact of
distributed techniques, the effect of the communication pattern and workload
distribution, the use of cost estimation for adaptive partitioning, the trade-off
fast versus accuracy in prediction-based approaches, the effectiveness of work
stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU
computations. [edited by author]IX n.s
Parallel hierarchical global illumination
Solving the global illumination problem is equivalent to determining the intensity of every wavelength of light in all directions at every point in a given scene. The complexity of the problem has led researchers to use approximation methods for solving the problem on serial computers. Rather than using an approximation method, such as backward ray tracing or radiosity, we have chosen to solve the Rendering Equation by direct simulation of light transport from the light sources. This paper presents an algorithm that solves the Rendering Equation to any desired accuracy, and can be run in parallel on distributed memory or shared memory computer systems with excellent scaling properties. It appears superior in both speed and physical correctness to recent published methods involving bidirectional ray tracing or hybrid treatments of diffuse and specular surfaces. Like progressive radiosity methods, it dynamically refines the geometry decomposition where required, but does so without the excessive storage requirements for ray histories. The algorithm, called Photon, produces a scene which converges to the global illumination solution. This amounts to a huge task for a 1997-vintage serial computer, but using the power of a parallel supercomputer significantly reduces the time required to generate a solution. Currently, Photon can be run on most parallel environments from a shared memory multiprocessor to a parallel supercomputer, as well as on clusters of heterogeneous workstations
The distributed ASCI supercomputer project
The Distributed ASCI Supercomputer (DAS) is a homogeneous wide-area distributed system consisting of four cluster computers at different locations. DAS has been used for research on communication software, parallel languages and programming systems, schedulers, parallel applications, and distributed applications. The paper gives a preview of the most interesting research results obtained so far in the DAS project
- …