71 research outputs found
Load Balancing Regular Meshes on SMPS with MPI
Domain decomposition for regular meshes on parallel computers has
traditionally been performed by attempting to exactly partition the work among the available processors (now cores). However, these
strategies often do not consider the inherent system noise which can hinder MPI application scalability to emerging peta-scale machines
with 10000+ nodes. In this work, we suggest a solution that uses a tunable hybrid static/dynamic scheduling strategy that can be incorporated into current MPI implementations of mesh codes. By applying this strategy to a 3D jacobi algorithm, we achieve performance gains
of at least 16% for 64 SMP nodes
Hybrid static/dynamic scheduling for already optimized dense matrix factorization
We present the use of a hybrid static/dynamic scheduling strategy of the task
dependency graph for direct methods used in dense numerical linear algebra.
This strategy provides a balance of data locality, load balance, and low
dequeue overhead. We show that the usage of this scheduling in communication
avoiding dense factorization leads to significant performance gains. On a 48
core AMD Opteron NUMA machine, our experiments show that we can achieve up to
64% improvement over a version of CALU that uses fully dynamic scheduling, and
up to 30% improvement over the version of CALU that uses fully static
scheduling. On a 16-core Intel Xeon machine, our hybrid static/dynamic
scheduling approach is up to 8% faster than the version of CALU that uses a
fully static scheduling or fully dynamic scheduling. Our algorithm leads to
speedups over the corresponding routines for computing LU factorization in well
known libraries. On the 48 core AMD NUMA machine, our best implementation is up
to 110% faster than MKL, while on the 16 core Intel Xeon machine, it is up to
82% faster than MKL. Our approach also shows significant speedups compared with
PLASMA on both of these systems
Low-overhead scheduling for improving performance of scientific applications
Application performance can degrade significantly due to node-local load imbalances during application execution on a large number of SMP nodes. These imbalances can arise from the machine, operating system, or the application itself. Although dynamic load balancing within a node can mitigate imbalances, such load balancing is challenging because of its impact to data movement and synchronization overhead. We developed a series of scheduling strategies that mitigate imbalances without incurring high overhead. Our strategies provide performance gains for various HPC codes, and perform better than widely known scheduling strategies such as OpenMP guided scheduling. Our developed scheme and methodology allows for scaling applications to next-generation clusters of SMPs with minimal application programmer intervention. We expect these techniques to be increasingly useful for future machines approaching exascale
Recommended from our members
Performance Analysis of the Lattice Boltzmann Model Beyond Navier-Stokes
The lattice Boltzmann method is increasingly important in facilitating large-scale fluid dynamics simulations. To date, these simulations have been built on discretized velocity models of up to 27 neighbors. Recent work has shown that higher order approximations of the continuum Boltzmann equation enable not only recovery of the Navier-Stokes hydrodynamics, but also simulations for a wider range of Knudsen numbers, which is especially important in micro- and nanoscale flows. These higher-order models have significant impact on both the communication and computational complexity of the application. We present a performance study of the higherorder models as compared to the traditional ones, on both the IBM Blue Gene/P and Blue Gene/Q architectures. We study the tradeoffs of many optimizations methods such as the use of deep halo level ghost cells that, alongside hybrid programming models, reduce the impact of extended models and enable efficient modeling of extreme regimes of computational fluid dynamics.Physic
Limb reconstruction system as a primary and definitive mode of fixation in open fractures of long bones
Background: Management of open fractures of long bones by the traditional systems is very complex. Limb reconstruction system (LRS) was considered as very effective, and offers rigid stabilization of fracture fragments and with an easy access to soft tissue care. The aim of the study was to determine the efficacy of LRS for treatment of open fractures of long bones.Methods: This prospective study included 30 cases of both the sexes aged between 11-60 years. Patients with closed fractures of long bones and fractures treated conservatively were excluded from the study. Their clinical and radiological evaluation will be done at presentation and certain specific intervals and evaluated for signs of bone union and associated complications.Results: The mean age of the patients participated in the study was 35.6 years with male predominance (93.3%). All patients (100%) were injured by road traffic accidents. 50% of the cases were of Grade 2 type of fractures. The most common complication encountered was pin tract infections seen in 8 cases. We had good results in 24 patients, moderate in 5 and poor in 1 patient using modified Anderson and Hutchinson’s criteria. Conclusions: LRS is an alternative to the traditional system of fixation in the primary management of open fractures of long bones. It is less cumbersome to the patient and more patient friendly in terms of reducing financial burden also. It is a definitive single stage procedure.
MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory
Hybrid parallel programming with the message passing interface (MPI) for internode communication in conjunction with a shared-memory programming model to manage intranode parallelism has become a dominant approach to scalable parallel programming. While this model provides a great deal of flexibility and performance potential, it saddles programmers with the complexity of utilizing two parallel programming systems in the same application. We introduce an MPI-integrated shared-memory programming model that is incorporated into MPI through a small extension to the one-sided communication interface. We discuss the integration of this interface with the MPI 3.0 one-sided semantics and describe solutions for providing portable and efficient data sharing, atomic operations, and memory consistency. We describe an implementation of the new interface in the MPICH2 and Open MPI implementations and demonstrate an average performance improvement of 40% to the communication component of a five-point stencil solve
Recommended from our members
Cannot make do without you: Outsourcing by knowledge-intensive new firms in supplier networks
How do new firms operating in dynamic environments organize their operations? Building on the transaction cost theory and the resource based view and using case study data from ten biotechnology start-ups and twenty of their suppliers, this research reveals that new firms outsourcing to highly-embedded suppliers are likely to secure access to a wider supplier network, attain best-in-class operational knowledge, and avoid supplier opportunism while facing low levels of relationship-specific investments. New firms outsourcing to suppliers at the network periphery are more likely to realize cost efficiencies, expose themselves to opportunism, uncertainty, and higher levels of relationship-specific investments but low levels of operational knowledge. We propose that new firms build five outsourcing competencies to realize benefits
Development and evaluation of introgression lines with yield enhancing genes of the Indian mega-variety of rice, MTU1010
MTU 1010 is an early maturing and high-yielding mega rice variety widely grown in an area of 3 Mha. It is characterised by limited grain number and panicle branching. To improve the grain number in MTU 1010, an IRRI breeding line, IR121055-2-10-5 was utilized as donor to transfer yield-enhancing genes Gn1a and OsSPL14 (associated with increased grain number and better panicle branching, respectively) into MTU1010 by Marker-Assisted Backcross Breeding (MABB). At each backcross generation, foreground selection was carried out with Gn1a and OsSPL14- specific molecular markers, whilst background selection was done with a set of SSR markers polymorphic between the IR121055-2-10-5 and MTU1010. With the use of a gene-specific marker, homozygous BC2 F2 plants carrying the yield-enhancing gene were identified and advanced through pedigree-method of selection till BC2 F6 and best performing ten lines were selected and evaluated in replicated station trials for yield contributing traits, where grain number and brancing per panicle exhibited high significant and positive correlation with single plant yield. Three promising lines namely RP6353-5-8-13-24, RP6353-26-13-39-5 and RP6353-32-12-8-16 with higher grain number and yield than MTU1010 were identified and nominated for evaluation in Initial Varietal Trial-Aerobic (IVT-Aerobic) of All India Crop Improvement Programme on Rice (AICRP), of which RP6353-26-13-39-5 (IET28674), was promoted for further testing
- …