1,569 research outputs found
Domain-Specific Acceleration and Auto-Parallelization of Legacy Scientific Code in FORTRAN 77 using Source-to-Source Compilation
Massively parallel accelerators such as GPGPUs, manycores and FPGAs represent
a powerful and affordable tool for scientists who look to speed up simulations
of complex systems. However, porting code to such devices requires a detailed
understanding of heterogeneous programming tools and effective strategies for
parallelization. In this paper we present a source to source compilation
approach with whole-program analysis to automatically transform single-threaded
FORTRAN 77 legacy code into OpenCL-accelerated programs with parallelized
kernels.
The main contributions of our work are: (1) whole-source refactoring to allow
any subroutine in the code to be offloaded to an accelerator. (2) Minimization
of the data transfer between the host and the accelerator by eliminating
redundant transfers. (3) Pragmatic auto-parallelization of the code to be
offloaded to the accelerator by identification of parallelizable maps and
reductions.
We have validated the code transformation performance of the compiler on the
NIST FORTRAN 78 test suite and several real-world codes: the Large Eddy
Simulator for Urban Flows, a high-resolution turbulent flow model; the shallow
water component of the ocean model Gmodel; the Linear Baroclinic Model, an
atmospheric climate model and Flexpart-WRF, a particle dispersion simulator.
The automatic parallelization component has been tested on as 2-D Shallow
Water model (2DSW) and on the Large Eddy Simulator for Urban Flows (UFLES) and
produces a complete OpenCL-enabled code base. The fully OpenCL-accelerated
versions of the 2DSW and the UFLES are resp. 9x and 20x faster on GPU than the
original code on CPU, in both cases this is the same performance as manually
ported code.Comment: 12 pages, 5 figures, submitted to "Computers and Fluids" as full
paper from ParCFD conference entr
Non-Strict Independence-Based Program Parallelization Using Sharing and Freeness Information.
The current ubiquity of multi-core processors has brought renewed interest in program parallelization. Logic programs allow studying the parallelization of programs with complex, dynamic data structures with (declarative) pointers in a comparatively simple semantic setting. In this context, automatic parallelizers which exploit and-parallelism rely on notions of independence in order to ensure certain efficiency properties. “Non-strict” independence is a more relaxed notion than the traditional notion of “strict” independence which still ensures the relevant efficiency properties and can allow considerable more parallelism. Non-strict independence cannot be determined solely at run-time (“a priori”) and thus global analysis is a requirement. However, extracting non-strict independence information from available analyses and domains is non-trivial. This paper provides on one hand an extended presentation of our classic techniques for compile-time detection of non-strict independence based on extracting information from (abstract interpretation-based) analyses using the now well understood and popular Sharing + Freeness domain. This includes algorithms for combined compile-time/run-time detection which involve special run-time checks for this type of parallelism. In addition, we propose herein novel annotation (parallelization) algorithms, URLP and CRLP, which are specially suited to non-strict independence. We also propose new ways of using the Sharing + Freeness information to optimize how the run-time environments of goals are kept apart during parallel execution. Finally, we also describe the implementation of these techniques in our parallelizing compiler and recall some early performance results. We provide as well an extended description of our pictorial representation of sharing and freeness information
Towards a High-Level Implementation of Execution Primitives for Unrestricted, Independent And-Parallelism
Most efficient implementations of parallel logic programming rely on complex low-level machinery which is arguably difficult to implement and modify. We explore an alternative approach aimed at taming that complexity by raising core parts of the implementation to the source language level for the particular case of and-parallellism. We handle a significant portion of the parallel implementation at the Prolog level with the help of a comparatively small number of concurrency.related primitives which take case of lower-level tasks such as locking, thread management, stack set management, etc. The approach does not eliminate altogether modifications to the abstract machine, but it does greatly simplify them and it also facilitates experimenting with different alternatives. We show how this approach allows implementing both restricted and unrestricted (i.e., non fork-join) parallelism. Preliminary esperiments show thay the performance safcrifieced is reasonable, although granularity of unrestricted parallelism contributes to better observed speedups
Software for Parallel Computing: the LAM Implementation of MPI
Many econometric problems can benefit from the application of parallel computing techniques, and recent advances in hardware and software have made such application feasible. There are a number of freely available software libraries that make it possible to write message passing parallel programs using personal computers or Unix workstations. This review discusses one of these—the LAM (Local Area Multiprocessor) implementation of MPI (the Message Passing Interface)
The Removal of Numerical Drift from Scientific Models
Computer programs often behave differently under different compilers or in
different computing environments. Relative debugging is a collection of
techniques by which these differences are analysed. Differences may arise
because of different interpretations of errors in the code, because of bugs in
the compilers or because of numerical drift, and all of these were observed in
the present study. Numerical drift arises when small and acceptable differences
in values computed by different systems are integrated, so that the results
drift apart. This is well understood and need not degrade the validity of the
program results. Coding errors and compiler bugs may degrade the results and
should be removed. This paper describes a technique for the comparison of two
program runs which removes numerical drift and therefore exposes coding and
compiler errors. The procedure is highly automated and requires very little
intervention by the user. The technique is applied to the Weather Research and
Forecasting model, the most widely used weather and climate modelling code.Comment: 12 page
Semi-automatic fault localization
One of the most expensive and time-consuming components of the debugging
process is locating the errors or faults. To locate faults, developers must identify
statements involved in failures and select suspicious statements that might contain
faults. In practice, this localization is done by developers in a tedious and manual
way, using only a single execution, targeting only one fault, and having a limited
perspective into a large search space.
The thesis of this research is that fault localization can be partially automated
with the use of commonly available dynamic information gathered from test-case
executions in a way that is effective, efficient, tolerant of test cases that pass but also
execute the fault, and scalable to large programs that potentially contain multiple
faults. The overall goal of this research is to develop effective and efficient fault
localization techniques that scale to programs of large size and with multiple faults.
There are three principle steps performed to reach this goal: (1) Develop practical
techniques for locating suspicious regions in a program; (2) Develop techniques to
partition test suites into smaller, specialized test suites to target specific faults; and
(3) Evaluate the usefulness and cost of these techniques.
In this dissertation, the difficulties and limitations of previous work in the area
of fault-localization are explored. A technique, called Tarantula, is presented that
addresses these difficulties. Empirical evaluation of the Tarantula technique shows
that it is efficient and effective for many faults. The evaluation also demonstrates
that the Tarantula technique can loose effectiveness as the number of faults increases.
To address the loss of effectiveness for programs with multiple faults, supporting
techniques have been developed and are presented. The empirical evaluation of these
supporting techniques demonstrates that they can enable effective fault localization in
the presence of multiple faults. A new mode of debugging, called parallel debugging, is
developed and empirical evidence demonstrates that it can provide a savings in terms
of both total expense and time to delivery. A prototype visualization is provided to
display the fault-localization results as well as to provide a method to interact and
explore those results. Finally, a study on the effects of the composition of test suites
on fault-localization is presented.Ph.D.Committee Chair: Harrold, Mary Jean; Committee Member: Orso, Alessandro; Committee Member: Pande, Santosh; Committee Member: Reiss, Steven; Committee Member: Rugaber, Spence
Interactive Trace-Based Analysis Toolset for Manual Parallelization of C Programs
Massive amounts of legacy sequential code need to be parallelized to make better use of modern multiprocessor architectures. Nevertheless, writing parallel programs is still a difficult task. Automated parallelization methods can be effective both at the statement and loop levels and, recently, at the task level, but they are still restricted to specific source code constructs or application domains. We present in this article an innovative toolset that supports developers when performing manual code analysis and parallelization decisions. It automatically collects and represents the program profile and data dependencies in an interactive graphical format that facilitates the analysis and discovery of manual parallelization opportunities. The toolset can be used for arbitrary sequential C programs and parallelization patterns. Also, its program-scope data dependency tracing at runtime can complement the tools based on static code analysis and can also benefit from it at the same time. We also tested the effectiveness of the toolset in terms of time to reach parallelization decisions and of their quality. We measured a significant improvement for several real-world representative applications
- …