1,586 research outputs found
Recommended from our members
Fine grain software pipelining of non-vectorizable nested loops
This paper presents a new technique to parallelize nested loops at the statement level. It transforms sequential nested loops, either vectorizable or not, into parallel ones. Previously, the wavefront method was used to parallelize non-vectorizable nested loops. However, in order to reduce the complexity of parallelization, the wavefront method regards an iteration as an unbreakable scheduling unit and draws parallelism through iteration overlapping. Our technique takes a statement rather than an iteration as the scheduling unit and exploits parallelism by overlapping the statements in all dimensions. In this paper, we show how this finer grain parallelization can be achieved with reasonable computational complexity, and the effectiveness of the resulting method in exploiting parallelism
Redesigning OP2 Compiler to Use HPX Runtime Asynchronous Techniques
Maximizing parallelism level in applications can be achieved by minimizing
overheads due to load imbalances and waiting time due to memory latencies.
Compiler optimization is one of the most effective solutions to tackle this
problem. The compiler is able to detect the data dependencies in an application
and is able to analyze the specific sections of code for parallelization
potential. However, all of these techniques provided with a compiler are
usually applied at compile time, so they rely on static analysis, which is
insufficient for achieving maximum parallelism and producing desired
application scalability. One solution to address this challenge is the use of
runtime methods. This strategy can be implemented by delaying certain amount of
code analysis to be done at runtime. In this research, we improve the parallel
application performance generated by the OP2 compiler by leveraging HPX, a C++
runtime system, to provide runtime optimizations. These optimizations include
asynchronous tasking, loop interleaving, dynamic chunk sizing, and data
prefetching. The results of the research were evaluated using an Airfoil
application which showed a 40-50% improvement in parallel performance.Comment: 18th IEEE International Workshop on Parallel and Distributed
Scientific and Engineering Computing (PDSEC 2017
Tackling Exascale Software Challenges in Molecular Dynamics Simulations with GROMACS
GROMACS is a widely used package for biomolecular simulation, and over the
last two decades it has evolved from small-scale efficiency to advanced
heterogeneous acceleration and multi-level parallelism targeting some of the
largest supercomputers in the world. Here, we describe some of the ways we have
been able to realize this through the use of parallelization on all levels,
combined with a constant focus on absolute performance. Release 4.6 of GROMACS
uses SIMD acceleration on a wide range of architectures, GPU offloading
acceleration, and both OpenMP and MPI parallelism within and between nodes,
respectively. The recent work on acceleration made it necessary to revisit the
fundamental algorithms of molecular simulation, including the concept of
neighborsearching, and we discuss the present and future challenges we see for
exascale simulation - in particular a very fine-grained task parallelism. We
also discuss the software management, code peer review and continuous
integration testing required for a project of this complexity.Comment: EASC 2014 conference proceedin
Recommended from our members
N-Dimensional Perfect Pipelining
In this paper, we introduce a technique to parallelize nested loops at the fine grain level. It is a generalization of Perfect Pipelining which was developed to parallelize a single-nested loop at the fine grain level. Previous techniques that can parallelize nested loops, e.g. DOACROSS or Wavefront method, mostly belong to the coarse grain approach. We explain our method, contrast it with the coarse grain techniques, and show the benefits of parallelizing nested loops at the fine grain level
Data locality and parallelism optimization using a constraint-based approach
Cataloged from PDF version of article.Embedded applications are becoming increasingly complex and processing ever-increasing datasets. In
the context of data-intensive embedded applications, there have been two complementary approaches to
enhancing application behavior, namely, data locality optimizations and improving loop-level parallelism.
Data locality needs to be enhanced to maximize the number of data accesses satisfied from the higher
levels of the memory hierarchy. On the other hand, compiler-based code parallelization schemes require
a fresh look for chip multiprocessors as interprocessor communication is much cheaper than off-chip
memory accesses. Therefore, a compiler needs to minimize the number of off-chip memory accesses. This
can be achieved by considering multiple loop nests simultaneously. Although compilers address these two
problems, there is an inherent difficulty in optimizing both data locality and parallelism simultaneously.
Therefore, an integrated approach that combines these two can generate much better results than each
individual approach. Based on these observations, this paper proposes a constraint network (CN)-based
formulation for data locality optimization and code parallelization. The paper also presents experimental
evidence, demonstrating the success of the proposed approach, and compares our results with those
obtained through previously proposed approaches. The experiments from our implementation indicate
that the proposed approach is very effective in enhancing data locality and parallelization.
© 2010 Elsevier Inc. All rights reserved
Nested-Loops Tiling for Parallelization and Locality Optimization
Data locality improvement and nested loops parallelization are two complementary and competing approaches for optimizing loop nests that constitute a large portion of computation times in scientific and engineering programs. While there are effective methods for each one of these, prior studies have paid less attention to address these two simultaneously. This paper proposes a unified approach that integrates these two techniques to obtain an appropriate locality conscious loop transformation to partition the loop iteration space into outer parallel tiled loops. The approach is based on the polyhedral model to achieve a multidimensional affine scheduling as a transformation that result the largest groups of tilable loops with maximum coarse grain parallelism, as far as possible. Furthermore, tiles will be scheduled on processor cores to exploit maximum data reuse through scheduling tiles with high volume of data sharing on the same core consecutively or on different cores with shared cache at around the same time
- …