2,888 research outputs found
Dynamic Load Balancing for Compressible Multiphase Turbulence
CMT-nek is a new scientific application for performing high fidelity
predictive simulations of particle laden explosively dispersed turbulent flows.
CMT-nek involves detailed simulations, is compute intensive and is targeted to
be deployed on exascale platforms. The moving particles are the main source of
load imbalance as the application is executed on parallel processors. In a
demonstration problem, all the particles are initially in a closed container
until a detonation occurs and the particles move apart. If all processors get
an equal share of the fluid domain, then only some of the processors get
sections of the domain that are initially laden with particles, leading to
disparate load on the processors. In order to eliminate load imbalance in
different processors and to speedup the makespan, we present different load
balancing algorithms for CMT-nek on large scale multi-core platforms consisting
of hundred of thousands of cores. The detailed process of the load balancing
algorithms are presented. The performance of the different load balancing
algorithms are compared and the associated overheads are analyzed. Evaluations
on the application with and without load balancing are conducted and these show
that with load balancing, simulation time becomes faster by a factor of up to
.Comment: This paper has been accepted by ACM International Conference on
Supercomputing (ICS) 201
A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
<p>Abstract</p> <p>Background</p> <p>Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts.</p> <p>Results</p> <p>To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (<it>e.g</it>., for biomolecular sequences, alignments, structures) and functionality (<it>e.g</it>., to parse/write standard file formats).</p> <p>Conclusions</p> <p>PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at <url>http://muralab.org/PaPy</url>, and includes extensive documentation and annotated usage examples.</p
Mapping applications onto FPGA-centric clusters
High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies.
In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance.
In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance.
The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping.
This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors
Dynamic load balancing of parallel road traffic simulation
The objective of this research was to investigate, develop and evaluate dynamic
load-balancing strategies for parallel execution of microscopic road traffic simulations. Urban road traffic simulation presents irregular, and dynamically varying
distributed computational load for a parallel processor system. The dynamic
nature of road traffic simulation systems lead to uneven load distribution during simulation, even for a system that starts off with even load distributions. Load balancing is a potential way of achieving improved performance by reallocating
work from highly loaded processors to lightly loaded processors leading to
a reduction in the overall computational time. In dynamic load balancing,
workloads are adjusted continually or periodically throughout the computation.
In this thesis load balancing strategies were evaluated and some load balancing
policies developed. A load index and a profitability determination algorithms
were developed. These were used to enhance two load balancing algorithms. One
of the algorithms exhibits local communications and distributed load evaluation
between the neighbour partitions (diffusion algorithm) and the other algorithm
exhibits both local and global communications while the decision making is
centralized (MaS algorithm). The enhanced algorithms were implemented and
synthesized with a research parallel traffic simulation. The performance of the
research parallel traffic simulator, optimized with the two modified dynamic load balancing strategies were studied
Many-Task Computing and Blue Waters
This report discusses many-task computing (MTC) generically and in the
context of the proposed Blue Waters systems, which is planned to be the largest
NSF-funded supercomputer when it begins production use in 2012. The aim of this
report is to inform the BW project about MTC, including understanding aspects
of MTC applications that can be used to characterize the domain and
understanding the implications of these aspects to middleware and policies.
Many MTC applications do not neatly fit the stereotypes of high-performance
computing (HPC) or high-throughput computing (HTC) applications. Like HTC
applications, by definition MTC applications are structured as graphs of
discrete tasks, with explicit input and output dependencies forming the graph
edges. However, MTC applications have significant features that distinguish
them from typical HTC applications. In particular, different engineering
constraints for hardware and software must be met in order to support these
applications. HTC applications have traditionally run on platforms such as
grids and clusters, through either workflow systems or parallel programming
systems. MTC applications, in contrast, will often demand a short time to
solution, may be communication intensive or data intensive, and may comprise
very short tasks. Therefore, hardware and software for MTC must be engineered
to support the additional communication and I/O and must minimize task dispatch
overheads. The hardware of large-scale HPC systems, with its high degree of
parallelism and support for intensive communication, is well suited for MTC
applications. However, HPC systems often lack a dynamic resource-provisioning
feature, are not ideal for task communication via the file system, and have an
I/O system that is not optimized for MTC-style applications. Hence, additional
software support is likely to be required to gain full benefit from the HPC
hardware
Integrating Algorithmic and Systemic Load Balancing Strategies in Parallel Scientific Applications
Load imbalance is a major source of performance degradation in parallel scientific applications. Load balancing increases the efficient use of existing resources and improves performance of parallel applications running in distributed environments. At a coarse level of granularity, advances in runtime systems for parallel programs have been proposed in order to control available resources as efficiently as possible by utilizing idle resources and using task migration. At a finer granularity level, advances in algorithmic strategies for dynamically balancing computational loads by data redistribution have been proposed in order to respond to variations in processor performance during the execution of a given parallel application. Algorithmic and systemic load balancing strategies have complementary set of advantages. An integration of these two techniques is possible and it should result in a system, which delivers advantages over each technique used in isolation. This thesis presents a design and implementation of a system that combines an algorithmic fine-grained data parallel load balancing strategy called Fractiling with a systemic coarse-grained task-parallel load balancing system called Hector. It also reports on experimental results of running N-body simulations under this integrated system. The experimental results indicate that a distributed runtime environment, which combines both algorithmic and systemic load balancing strategies, can provide performance advantages with little overhead, underscoring the importance of this approach in large complex scientific applications
Simultaneous input and output matrix partitioning for outer-product-parallel sparse matrix-matrix multiplication
Cataloged from PDF version of article.FFor outer-product-parallel sparse matrix-matrix multiplication (SpGEMM) of the form C=A×B, we propose three hypergraph models that achieve simultaneous partitioning of input and output matrices without any replication of input data. All three hypergraph models perform conformable one-dimensional (1D) columnwise and 1D rowwise partitioning of the input matrices A and B, respectively. The first hypergraph model performs two-dimensional (2D) nonzero-based partitioning of the output matrix, whereas the second and third models perform 1D rowwise and 1D columnwise partitioning of the output matrix, respectively. This partitioning scheme induces a two-phase parallel SpGEMM algorithm, where communication-free local SpGEMM computations constitute the first phase and the multiple single-node-accumulation operations on the local SpGEMM results constitute the second phase. In these models, the two partitioning constraints defined on weights of vertices encode balancing computational loads of processors during the two separate phases of the parallel SpGEMM algorithm. The partitioning objective of minimizing the cutsize defined over the cut nets encodes minimizing the total volume of communication that will occur during the second phase of the parallel SpGEMM algorithm. An MPI-based parallel SpGEMM library is developed to verify the validity of our models in practice. Parallel runs of the library for a wide range of realistic SpGEMM instances on two large-scale parallel systems JUQUEEN (an IBM BlueGene/Q system) and SuperMUC (an Intel-based cluster) show that the proposed hypergraph models attain high speedup values. © 2014 Society for Industrial and Applied Mathematics
- …