6,269 research outputs found
GPU Cost Estimation for Load Balancing in Parallel Ray Tracing
Interactive ray tracing has seen enormous progress in recent years. However, advanced rendering techniques requiring many million rays per second are still not feasible at interactive speed, and are only possible by means of highly parallel ray tracing. When using compute clusters, good load balancing is crucial in order to fully exploit the available computational power, and to not suffer from the overhead involved by synchronization barriers. In this paper, we present a novel GPU method to compute a costmap: a per-pixel cost estimate of the ray tracing rendering process. We show that the cost map is a powerful tool to improve load balancing in
parallel ray tracing, and it can be used for adaptive task partitioning and enhanced dynamic load balancing. Its effectiveness has been proven in a parallel ray tracer implementation tailored for a cluster of workstations
Memory-savvy distributed interactive ray tracing
Journal ArticleInteractive ray tracing in a cluster environment requires paying close attention to the constraints of a loosely coupled distributed system. To render large scenes interactively, memory limits and network latency must be addressed efficiently. In this paper, we improve previous systems by moving to a page-based distributed shared memory layer, resulting in faster and easier access to a shared memory space. The technique is designed to take advantage of the large virtual memory space provided by 64-bit machines. We also examine task reuse through decentralized load balancing and primitive reorganization to complement the shared memory system. These techniques improve memory coherence and are valuable when physical memory is limited. C-SAF
EasyFJP: Providing Hybrid Parallelism as a Concern for Divide and Conquer Java Applications
Because of the increasing availability of multi-core machines, clus- ters, Grids, and combinations of these there is now plenty of computational power,but today's programmers are not fully prepared to exploit parallelism. In particular, Java has helped in handling the heterogeneity of such environments. However, there is a lot of ground to cover regarding facilities to easily and elegantly parallelizing applications. One path to this end seems to be the synthesis of semi- automatic parallelism and Parallelism as a Concern (PaaC). The former allows users to be mostly unaware of parallel exploitation problems and at the same time manually optimize parallelized applications whenever necessary, while the latter allows applications to be separated from parallel-related code. In this paper, we present EasyFJP, an approach that implicitly exploits parallelism in Java applications based on the concept of fork-join synchronization pattern, a simple but effective abstraction for creating and coordinating parallel tasks. In addition, EasyFJP lets users to explicitly optimize applications through policies, or user-provided rules to dynamically regulate task granularity. Finally, EasyFJP relies on PaaC by means of source code generation techniques to wire applications and parallel-specific code together. Experiments with real-world applications on an emulated Grid and a cluster evidence that EasyFJP delivers competitive performance compared to state-of-the-art Java parallel programming tools.Fil: Mateos Diaz, Cristian Maximiliano. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Zunino Suarez, Alejandro Octavio. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina;Fil: Hirsch Jofré, Matías Eberardo. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico - CONICET - Tandil. Instituto Superior de Ingenieria del Software; Argentina
Memory sharing for interactive ray tracing on clusters
ManuscriptWe present recent results in the application of distributed shared memory to image parallel ray tracing on clusters. Image parallel rendering is traditionally limited to scenes that are small enough to be replicated in the memory of each node, because any processor may require access to any piece of the scene. We solve this problem by making all of a cluster's memory available through software distributed shared memory layers. With gigabit ethernet connections, this mechanism is sufficiently fast for interactive rendering of multi-gigabyte datasets. Object- and page-based distributed shared memories are compared, and optimizations for efficient memory use are discussed
Redundant movements in autonomous mobility: experimental and theoretical analysis
<p>Distributed load balancers exhibit thrashing where tasks are repeatedly moved between locations due to incomplete global load information. This paper shows that systems of autonomous mobile programs (AMPs) exhibit the same behaviour, and identifies two types of redundant movement (greedy effect). AMPs are unusual in that, in place of some external load management system, each AMP periodically recalculates network and program parameters and may independently move to a better execution environment. Load management emerges from the behaviour of collections of AMPs.</p>
<p>The paper explores the extent of greedy effects by simulating collections of AMPs and proposes negotiating AMPs (NAMPs) to ameliorate the problem. We present the design of AMPs with a competitive negotiation scheme (cNAMPs), and compare their performance with AMPs by simulation. We establish new properties of balanced networks of AMPs, and use these to provide a theoretical analysis of greedy effects.</p>
Efficient distributed load balancing for parallel algorithms
2009 - 2010With the advent of massive parallel processing technology, exploiting the power
offered by hundreds, or even thousands of processors is all but a trivial task.
Computing by using multi-processor, multi-core or many-core adds a number of
additional challenges related to the cooperation and communication of multiple
processing units.
The uneven distribution of data among the various processors, i.e. the load
imbalance, represents one of the major problems in data parallel applications.
Without good load distribution strategies, we cannot reach good speedup, thus
good efficiency.
Load balancing strategies can be classified in several ways, according to the
methods used to balance workload. For instance, dynamic load balancing algorithms
make scheduling decisions during the execution and commonly results
in better performance compared to static approaches, where task assignment is
done before the execution.
Even more important is the difference between centralized and distributed
load balancing approaches. In fact, despite that centralized algorithms have
a wider vision of the computation, hence may exploit smarter balancing techniques,
they expose global synchronization and communication bottlenecks involving
the master node. This definitely does not assure scalability with the
number of processors.
This dissertation studies the impact of different load balancing strategies.
In particular, one of the key observations driving our work is that distributed
algorithms work better than centralized ones in the context of load balancing
for multi-processors (alike for multi-cores and many-cores as well).
We first show a centralized approach for load balancing, then we propose several
distributed approaches for problems having different parallelization, workload
distribution and communication pattern. We try to efficiently combine several
approaches to improve performance, in particular using predictive metrics
to obtain a per task compute-time estimation, using adaptive subdivision, improving
dynamic load balancing and addressing distributed balancing schemas.
The main challenge tackled on this thesis has been to combine all these approaches
together in new and efficient load balancing schemas.
We assess the proposed balancing techniques, starting from centralized approaches
to distributed ones, in distinctive real case scenarios: Mesh-like computation,
Parallel Ray Tracing, and Agent-based Simulations. Moreover, we
test our algorithms with parallel hardware such has cluster of workstations,
multi-core processors and exploiting SIMD vectorial instruction set.
Finally, we conclude the thesis with several remarks, about the impact of
distributed techniques, the effect of the communication pattern and workload
distribution, the use of cost estimation for adaptive partitioning, the trade-off
fast versus accuracy in prediction-based approaches, the effectiveness of work
stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU
computations. [edited by author]IX n.s
Doctor of Philosophy
dissertationRadiation is the dominant mode of heat transfer in high temperature combustion environments. Radiative heat transfer affects the gas and particle phases, including all the associated combustion chemistry. The radiative properties are in turn affected by the turbulent flow field. This bi-directional coupling of radiation turbulence interactions poses a major challenge in creating parallel-capable, high-fidelity combustion simulations. In this work, a new model was developed in which reciprocal monte carlo radiation was coupled with a turbulent, large-eddy simulation combustion model. A technique wherein domain patches are stitched together was implemented to allow for scalable parallelism. The combustion model runs in parallel on a decomposed domain. The radiation model runs in parallel on a recomposed domain. The recomposed domain is stored on each processor after information sharing of the decomposed domain is handled via the message passing interface. Verification and validation testing of the new radiation model were favorable. Strong scaling analyses were performed on the Ember cluster and the Titan cluster for the CPU-radiation model and GPU-radiation model, respectively. The model demonstrated strong scaling to over 1,700 and 16,000 processing cores on Ember and Titan, respectively
High-fidelity rendering on shared computational resources
The generation of high-fidelity imagery is a computationally expensive process
and parallel computing has been traditionally employed to alleviate this cost.
However, traditional parallel rendering has been restricted to expensive shared
memory or dedicated distributed processors. In contrast, parallel computing on
shared resources such as a computational or a desktop grid, offers a low cost alternative. But, the prevalent rendering systems are currently incapable of seamlessly handling such shared resources as they suffer from high latencies, restricted
bandwidth and volatility. A conventional approach of rescheduling failed jobs in
a volatile environment inhibits performance by using redundant computations.
Instead, clever task subdivision along with image reconstruction techniques provides an unrestrictive fault-tolerance mechanism, which is highly suitable for
high-fidelity rendering. This thesis presents novel fault-tolerant parallel rendering algorithms for effectively tapping the enormous inexpensive computational
power provided by shared resources.
A first of its kind system for fully dynamic high-fidelity interactive rendering
on idle resources is presented which is key for providing an immediate feedback
to the changes made by a user. The system achieves interactivity by monitoring
and adapting computations according to run-time variations in the computational
power and employs a spatio-temporal image reconstruction technique for enhancing the visual fidelity. Furthermore, algorithms described for time-constrained offline rendering of still images and animation sequences, make it possible to deliver
the results in a user-defined limit. These novel methods enable the employment
of variable resources in deadline-driven environments
- …