88,405 research outputs found
Integrating Algorithmic and Systemic Load Balancing Strategies in Parallel Scientific Applications
Load imbalance is a major source of performance degradation in parallel scientific applications. Load balancing increases the efficient use of existing resources and improves performance of parallel applications running in distributed environments. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. At a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Algorithmic and systemic load balancing strategies have complementary set of advantages. An integration of these two techniques is possible and it should result in a system, which delivers advantages over each technique used in isolation. This thesis presents a design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy called Fractiling with a systemic coarse-grained task-parallel load balancing system called Hector. It also reports on experimental results of running N-body simulations under this integrated system. The experimental results indicate that a distributed runtime environment, which combines both algorithmic and systemic load balancing strategies, can provide performance advantages with little overhead, underscoring the importance of this approach in large complex scientific applications
Efficient resource utilization in shared-everything environments
Efficient resource usage is a key to achieve better performance in parallel database systems. Up to now, most research has focussed on balancing the load on several resources of the same type, i.e. balancing either CPU load or I/O load. In this paper, we present emph{floating probe, a strategy for parallel evaluation of pipelining segments in a shared-everything environment that provides dynamic load balancing between CPU- and I/O-resources. The key idea of floating probe is to overlap---as much as possible with respect to data dependencies---I/O-bound build phase and CPU-bound probe phase of pipelining segments to improve resource utilization. Simulation results show, that floating probe achieves shorter execution times while consuming less memory than conventional pipelining strategies
Immunological Approaches to Load Balancing in MIMD Systems
Effective utilization of Multiple-Instruction-Multiple-Data (MIMD) parallel
computers requires the application of good load balancing techniques. In this
paper we show that heuristics derived from observation of complex natural
systems, such as the mammalian immune system, can lead to effective load
balancing strategies. In particular, the immune system processes of regulation,
suppression, tolerance, and memory are seen to be powerful load balancing
mechanisms.
We provide a detailed example of our approach applied to parallelization of
an image processing task, that of extracting the circuit design from the images
of the layers of a CMOS integrated circuit. The results of this experiment show
that good speedup characteristics can be obtained when using immune system
derived load balancing strategies.Comment: The work described in this paper was done between 1990-2001, and was
not published at that tim
Design and implementation of parallel video encoding strategies using divisible load analysis
The processing time needed for motion estimation usually accounts for a significant part of the overall processing time of the video encoder. To improve the video encoding speed, reducing the execution time for motion estimation process is essential. Parallel implementation of video encoding systems using either the software or the hardware approach has attracted much attention in the area of real time video coding. In this paper, we attempt to implement a video encoder on a bus network. Usually, for such a parallel system, the key concern is associated with partitioning and balancing of the computational load among the processors such that the overall processing time of the video encoder is minimized. With the use of the divisible load theory (DLT) paradigm, a strip-wise load partitioning/balancing scheme, a load distribution strategy, two implementation strategies are developed to exploit the data parallelism inherent in the video encoding process. The striking feature of our design is that,both the granularity of the load partitions and all the associated overheads caused during parallel video encoding process can be explicitly considered. This significantly contributes to the minimization of the overall processing time of the video encoder. Extensive experimental studies are carried out to test the effectiveness of the proposed strategies. The performance of the parallel video encoder is quantified using the metrics speedup and performance gain, respectively. The experimental results show that our strategies are effective for exploiting the available parallelism inherent in the video encoding process and provide a theoretical insight on how to analytically quantify and minimize the overall processing time of a parallel system. The proposed strategies can be easily extended and applied to improve other existing parallel systems
Designing a scalable dynamic load -balancing algorithm for pipelined single program multiple data applications on a non-dedicated heterogeneous network of workstations
Dynamic load balancing strategies have been shown to be the most critical part of an efficient implementation of various applications on large distributed computing systems. The need for dynamic load balancing strategies increases when the underlying hardware is a non-dedicated heterogeneous network of workstations (HNOW). This research focuses on the single program multiple data (SPMD) programming model as it has been extensively used in parallel programming for its simplicity and scalability in terms of computational power and memory size.;This dissertation formally defines and addresses the problem of designing a scalable dynamic load-balancing algorithm for pipelined SPMD applications on non-dedicated HNOW. During this process, the HNOW parameters, SPMD application characteristics, and load-balancing performance parameters are identified.;The dissertation presents a taxonomy that categorizes general load balancing algorithms and a methodology that facilitates creating new algorithms that can harness the HNOW computing power and still preserve the scalability of the SPMD application.;The dissertation devises a new algorithm, DLAH (Dynamic Load-balancing Algorithm for HNOW). DLAH is based on a modified diffusion technique, which incorporates the HNOW parameters. Analytical performance bound for the worst-case scenario of the diffusion technique has been derived.;The dissertation develops and utilizes an HNOW simulation model to conduct extensive simulations. These simulations were used to validate DLAH and compare its performance to related dynamic algorithms. The simulations results show that DLAH algorithm is scalable and performs well for both homogeneous and heterogeneous networks. Detailed sensitivity analysis was conducted to study the effects of key parameters on performance
Feed-forward volume rendering algorithm for moderately parallel MIMD machines
Algorithms for direct volume rendering on parallel and vector processors are investigated. Volumes are transformed efficiently on parallel processors by dividing the data into slices and beams of voxels. Equal sized sets of slices along one axis are distributed to processors. Parallelism is achieved at two levels. Because each slice can be transformed independently of others, processors transform their assigned slices with no communication, thus providing maximum possible parallelism at the first level. Within each slice, consecutive beams are incrementally transformed using coherency in the transformation computation. Also, coherency across slices can be exploited to further enhance performance. This coherency yields the second level of parallelism through the use of the vector processing or pipelining. Other ongoing efforts include investigations into image reconstruction techniques, load balancing strategies, and improving performance
Efficient distributed load balancing for parallel algorithms
2009 - 2010With the advent of massive parallel processing technology, exploiting the power
offered by hundreds, or even thousands of processors is all but a trivial task.
Computing by using multi-processor, multi-core or many-core adds a number of
additional challenges related to the cooperation and communication of multiple
processing units.
The uneven distribution of data among the various processors, i.e. the load
imbalance, represents one of the major problems in data parallel applications.
Without good load distribution strategies, we cannot reach good speedup, thus
good efficiency.
Load balancing strategies can be classified in several ways, according to the
methods used to balance workload. For instance, dynamic load balancing algorithms
make scheduling decisions during the execution and commonly results
in better performance compared to static approaches, where task assignment is
done before the execution.
Even more important is the difference between centralized and distributed
load balancing approaches. In fact, despite that centralized algorithms have
a wider vision of the computation, hence may exploit smarter balancing techniques,
they expose global synchronization and communication bottlenecks involving
the master node. This definitely does not assure scalability with the
number of processors.
This dissertation studies the impact of different load balancing strategies.
In particular, one of the key observations driving our work is that distributed
algorithms work better than centralized ones in the context of load balancing
for multi-processors (alike for multi-cores and many-cores as well).
We first show a centralized approach for load balancing, then we propose several
distributed approaches for problems having different parallelization, workload
distribution and communication pattern. We try to efficiently combine several
approaches to improve performance, in particular using predictive metrics
to obtain a per task compute-time estimation, using adaptive subdivision, improving
dynamic load balancing and addressing distributed balancing schemas.
The main challenge tackled on this thesis has been to combine all these approaches
together in new and efficient load balancing schemas.
We assess the proposed balancing techniques, starting from centralized approaches
to distributed ones, in distinctive real case scenarios: Mesh-like computation,
Parallel Ray Tracing, and Agent-based Simulations. Moreover, we
test our algorithms with parallel hardware such has cluster of workstations,
multi-core processors and exploiting SIMD vectorial instruction set.
Finally, we conclude the thesis with several remarks, about the impact of
distributed techniques, the effect of the communication pattern and workload
distribution, the use of cost estimation for adaptive partitioning, the trade-off
fast versus accuracy in prediction-based approaches, the effectiveness of work
stealing combined with sorting, and a non-trivial way to exploit hybrid CPUGPU
computations. [edited by author]IX n.s
Automatic Performance Optimization on Heterogeneous Computer Systems using Manycore Coprocessors
Emerging computer architectures and advanced computing technologies, such as Intel’s Many Integrated Core (MIC) Architecture and graphics processing units (GPU), provide a promising solution to employ parallelism for achieving high performance, scalability and low power consumption. As a result, accelerators have become a crucial part in developing supercomputers. Accelerators usually equip with different types of cores and memory. It will compel application developers to reach challenging performance goals. The added complexity has led to the development of task-based runtime systems, which allow complex computations to be expressed as task graphs, and rely on scheduling algorithms to perform load balancing between all resources of the platforms. Developing good scheduling algorithms, even on a single node, and analyzing them can thus have a very high impact on the performance of current HPC systems. Load balancing strategies, at different levels, will be critical to obtain an effective usage of the heterogeneous hardware and to reduce the impact of communication on energy and performance. Implementing efficient load balancing algorithms, able to manage heterogeneous hardware, can be a challenging task, especially when a parallel programming model for distributed memory architecture.
In this paper, we presents several novel runtime approaches to determine the optimal data and task partition on heterogeneous platforms, targeting the Intel Xeon Phi accelerated heterogeneous systems
- …