3,704 research outputs found
An OpenMP based Parallelization Compiler for C Applications
Directive-drive programming models, such as OpenMP, are one solution for exploiting the potential of multi-core architectures, and enable developers to accelerate software applications by adding annotations on for-type loops and other code regions. However, manual parallelization of applications is known to be a non trivial and time consuming process, requiring parallel programming skills. Automatic parallelization approaches can reduce the burden on the application development side. This paper presents an OpenMP based automatic parallelization compiler, named AutoPar-Clava, for automatic identification and annotation of loops in C code. By using static analysis, parallelizable regions are detected, and a compilable OpenMP parallel code from the sequential version is produced. In order to reduce the accesses to shared memory by each thread, each variable is categorized into the proper OpenMP scoping. Also, AutoPar-Clava is able to support reduction on arrays, which is available since OpenMP 4.5. The effectiveness of AutoPar-Clava is evaluated by means of the Polyhedral Benchmark suite, and targeting a N-cores x86-based computing platform. The achieved results are very promising and compare favorably with closely related auto-parallelization compilers such as Intel C/C++ Compiler (i.e., icc), ROSE, TRACO, and Cetus
Recommended from our members
Experiences in porting mini-applications to OpenACC and OpenMP on heterogeneous systems
This article studies mini-applications—Minisweep, GenASiS, GPP, and FF—that use computational methods commonly encountered in HPC. We have ported these applications to develop OpenACC and OpenMP versions, and evaluated their performance on Titan (Cray XK7 with K20x GPUs), Cori (Cray XC40 with Intel KNL), Summit (IBM AC922 with Volta GPUs), and Cori-GPU (Cray CS-Storm 500NX with Intel Skylake and Volta GPUs). Our goals are for these new ports to be useful to both application and compiler developers, to document and describe the lessons learned and the methodology to create optimized OpenMP and OpenACC versions, and to provide a description of possible migration paths between the two specifications. Cases where specific directives or code patterns result in improved performance for a given architecture are highlighted. We also include discussions of the functionality and maturity of the latest compilers available on the above platforms with respect to OpenACC or OpenMP implementations
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
An LLVM Instrumentation Plug-in for Score-P
Reducing application runtime, scaling parallel applications to higher numbers
of processes/threads, and porting applications to new hardware architectures
are tasks necessary in the software development process. Therefore, developers
have to investigate and understand application runtime behavior. Tools such as
monitoring infrastructures that capture performance relevant data during
application execution assist in this task. The measured data forms the basis
for identifying bottlenecks and optimizing the code. Monitoring infrastructures
need mechanisms to record application activities in order to conduct
measurements. Automatic instrumentation of the source code is the preferred
method in most application scenarios. We introduce a plug-in for the LLVM
infrastructure that enables automatic source code instrumentation at
compile-time. In contrast to available instrumentation mechanisms in
LLVM/Clang, our plug-in can selectively include/exclude individual application
functions. This enables developers to fine-tune the measurement to the required
level of detail while avoiding large runtime overheads due to excessive
instrumentation.Comment: 8 page
- …