73 research outputs found
A Rules Based Approach to Analyze Data Dependent Transformation Strategies of a Supercompiler for Parallel Computers.
A supercompiler is a program that attempts to automatically restructure serial code into an equivalent parallel form. This restructuring is achieved through the application of various transformation strategies designed to remove data dependences. A data dependence is a relation between two programming statements that prevent those two statements from being executed in parallel. This research develops a rules based system to analyze the various data dependent transformation strategies of a supercompiler for parallel computers. With the information obtained from user input and the automated analysis of a program segment, this rules based analysis will be able to determine which of the available transformation strategies is the optimal one to be applied for a particular program segment
Parafrase restructuring of FORTRAN code for parallel processing
Parafrase transforms a FORTRAN code, subroutine by subroutine, into a parallel code for a vector and/or shared-memory multiprocessor system. Parafrase is not a compiler; it transforms a code and provides information for a vector or concurrent process. Parafrase uses a data dependency to reveal parallelism among instructions. The data dependency test distinguishes between recurrences and statements that can be directly vectorized or parallelized. A number of transformations are required to build a data dependency graph
A Simple Data Dependency Analyzer For C Programs.
Data dependencies that exist in a sequential program are a major hindrance towards parallelization
Overview of the MPEG Reconfigurable Video Coding Framework
International audienceVideo coding technology in the last 20 years has evolved producing a variety of different and complex algorithms and coding standards. So far the specification of such standards, and of the algorithms that build them, has been done case by case providing monolithic textual and reference software specifications in different forms and programming languages. However, very little attention has been given to provide a specification formalism that explicitly presents common components between standards, and the incremental modifications of such monolithic standards. The MPEG Reconfigurable Video Coding (RVC) framework is a new ISO standard currently under its final stage of standardization, aiming at providing video codec specifications at the level of library components instead of monolithic algorithms. The new concept is to be able to specify a decoder of an existing standard or a completely new configuration that may better satisfy application-specific constraints by selecting standard components from a library of standard coding algorithms. The possibility of dynamic configuration and reconfiguration of codecs also requires new methodologies and new tools for describing the new bitstream syntaxes and the parsers of such new codecs. The RVC framework is based on the usage of a new actor/ dataflow oriented language called CAL for the specification of the standard library and instantiation of the RVC decoder model. This language has been specifically designed for modeling complex signal processing systems. CAL dataflow models expose the intrinsic concurrency of the algorithms by employing the notions of actor programming and dataflow. The paper gives an overview of the concepts and technologies building the standard RVC framework and the non standard tools supporting the RVC model from the instantiation and simulation of the CAL model to software and/or hardware code synthesis
Dependence testing and vectorization of multimedia agents
We present a dependence testing algorithm that considers the short width of modern SIMD registers in a typical microprocessor. The test works by solving the dependence system with the generalized GCD algorithm and then simplifying the solution equations for a particular set of dependence distances. We start by simplifying each solution lattice to generate points that satisfy some small constant dependence distance that corresponds to the width of the register being used. We use the Power Test to efficiently perform Fourier-Motzkin Variable Elimination on the simplified systems in order to determine if dependences exist. The improvements described in this paper also extend our SIMD dependence test to loops with symbolic and triangular lower and upper bounds as well as array indices that contain unknown symbolic additive constants. The resulting analysis is used to guide the vectorization pass of a dynamic multimedia compiler used to compile software agents that process audio, video and image data. We fully detail the proposed dependence test in this paper, including the related work
Transformations of High-Level Synthesis Codes for High-Performance Computing
Specialized hardware architectures promise a major step in performance and
energy efficiency over the traditional load/store devices currently employed in
large scale computing systems. The adoption of high-level synthesis (HLS) from
languages such as C/C++ and OpenCL has greatly increased programmer
productivity when designing for such platforms. While this has enabled a wider
audience to target specialized hardware, the optimization principles known from
traditional software design are no longer sufficient to implement
high-performance codes. Fast and efficient codes for reconfigurable platforms
are thus still challenging to design. To alleviate this, we present a set of
optimizing transformations for HLS, targeting scalable and efficient
architectures for high-performance computing (HPC) applications. Our work
provides a toolbox for developers, where we systematically identify classes of
transformations, the characteristics of their effect on the HLS code and the
resulting hardware (e.g., increases data reuse or resource consumption), and
the objectives that each transformation can target (e.g., resolve interface
contention, or increase parallelism). We show how these can be used to
efficiently exploit pipelining, on-chip distributed fast memory, and on-chip
streaming dataflow, allowing for massively parallel architectures. To quantify
the effect of our transformations, we use them to optimize a set of
throughput-oriented FPGA kernels, demonstrating that our enhancements are
sufficient to scale up parallelism within the hardware constraints. With the
transformations covered, we hope to establish a common framework for
performance engineers, compiler developers, and hardware developers, to tap
into the performance potential offered by specialized hardware architectures
using HLS
Static Analysis of Upper and Lower Bounds on Dependences and Parallelism
Existing compilers often fail to parallelize sequential code, even
when a program can be manually transformed into parallel form
by a sequence of well-understood transformations
(as is the case for many of the Perfect Club Benchmark
programs).
These failures can occur for several reasons: the code transformations
implemented in the compiler may not be sufficient to produce parallel
code, the compiler may not find the proper sequence of
transformations, or the compiler may not be able to prove that one
of the necessary transformations is legal.
When a compiler extract sufficient parallelism from a program,
the programmer extract additional parallelism.
Unfortunately, the programmer is typically left to search for
parallelism without significant assistance.
The compiler generally does not give feedback about which parts of the
program might contain additional parallelism, or about the types of
transformations that might be needed to realize this parallelism.
Standard program transformations and dependence abstractions cannot be
used to provide this feedback.
In this paper, we propose a two step approach for the search for
parallelism in sequential programs:
We first construct several sets of constraints that describe, for each
statement, which iterations of that statement can be executed
concurrently.
By constructing constraints that correspond to different assumptions
about which dependences might be eliminated through additional
analysis, transformations and user assertions, we can determine
whether we can expose parallelism by eliminating dependences.
In the second step of our search for parallelism, we examine these
constraint sets to identify the kinds of transformations that are
needed to exploit scalable parallelism.
Our tests will identify conditional parallelism and parallelism that
can be exposed by combinations of transformations that reorder the
iteration space (such as loop interchange and loop peeling).
This approach lets us distinguish inherently sequential code from code
that contains unexploited parallelism.
It also produces information about the kinds of transformations that
will be needed to parallelize the code, without worrying about the
order of application of the transformations.
Furthermore, when our dependence test is inexact,
we can identify which unresolved dependences inhibit parallelism
by comparing the effects of assuming dependence or independence.
We are currently exploring the use of this information in
programmer-assisted parallelization.
(Also cross-referenced as UMIACS-TR-94-40
Optimization within a Unified Transformation Framework
Programmers typically want to write scientific programs in a high level
language with semantics based on a sequential execution model. To execute
efficiently on a parallel machine, however, a program typically needs to
contain explicit parallelism and possibly explicit communication and
synchronization. So, we need compilers to convert programs from the first
of these forms to the second. There are two basic choices to be made when
parallelizing a program. First, the computations of the program need to be
distributed amongst the set of available processors. Second, the computations
on each processor need to be ordered. My contribution has been the development
of simple mathematical abstractions for representing these choices and the
development of new algorithms for making these choices. I have developed a new
framework that achieves good performance by minimizing communication between
processors, minimizing the time processors spend waiting for messages from
other processors, and ordering data accesses so as to exploit the memory
hierarchy. This framework can be used by optimizing compilers, as well as by
interactive transformation tools. The state of the art for vectorizing
compilers is already quite good, but much work remains to bring parallelizing
compilers up to the same standard. The main contribution of my work can be
summarized as improving this situation by replacing existing ad hoc
parallelization techniques with a sound underlying foundation on which future
work can be built.
(Also cross-referenced as UMIACS-TR-96-93
- …