218 research outputs found

    SPOT: A DSL for Extending Fortran Programs with Metaprogramming

    Get PDF

    Automatic Parallelization of Affine Loops using Dependence and Cache analysis in a Binary Rewriter

    Get PDF
    Today, nearly all general-purpose computers are parallel, but nearly all software running on them is serial. Bridging this disconnect by manually rewriting source code in parallel is prohibitively expensive. Automatic parallelization technology is therefore an attractive alternative. We present a method to perform automatic parallelization in a binary rewriter. The input to the binary rewriter is the serial binary executable program and the output is a parallel binary executable. The advantages of parallelization in a binary rewriter versus a compiler include (i) compatibility with all compilers and languages; (ii) high economic feasibility from avoiding repeated compiler implementation; (iii) applicability to legacy binaries; and (iv) applicability to assembly-language programs. Adapting existing parallelizing compiler methods that work on source code to work on binary programs instead is a significant challenge. This is primarily because symbolic and array index information used in existing compiler parallelizers is not available in a binary. We show how to adapt existing parallelization methods to achieve equivalent parallelization from a binary without such information. We have also designed a affine cache reuse model that works inside a binary rewriter building on the parallelization techniques. It quantifies cache reuse in terms of the number of cache lines that will be required when a loop dimension is considered for the innermost position in a loop nest. This cache metric can be used to reason about affine code that results when affine code is transformed using affine transformations. Hence, it can be used to evaluate candidate transformation sequences to improve run-time directly from a binary. Results using our x86 binary rewriter called SecondWrite on a suite of dense- matrix regular programs from Polybench suite of benchmarks shows an geomean speedup of 6.81X from binary and 8.9X from source with 8 threads compared to the input serial binary on a x86 Xeon E5530 machine; and 8.31X from binary and 9.86X from source with 24 threads compared to the input serial binary on a x86 E7450 machine. Such regular loops are an important component of scientific and multi- media workloads, and are even present to a limited extent in otherwise non-regular programs. Further in this thesis we present a novel algorithm that enhances the past techniques significantly for loops with unknown loop bounds by guessing the loop bounds using only the memory expressions present in a loop. It then inserts run-time checks to see if these guesses were indeed correct and if correct executes the parallel version of the loop, else the serial version executes. These techniques are applied to the large affine benchmarks in SPEC2006 and OMP2001 and unlike previous methods the speedups from binary are as good as from source. We also present results on the number of loops parallelized directly from a binary with and without this algorithm. Among the 8 affine benchmarks among these suites, the best existing binary parallelization method achieves an geo-mean speedup of 1.33X, whereas our method achieves a speedup of 2.96X. This is close to the speedup from source code of 2.8X

    The PARSE Programming Paradigm. Part I: Software Development Methodology. Part II: Software Development Support Tools

    Get PDF
    The programming methodology of PARSE (parallel software environment), a software environment being developed for reconfigurable non-shared memory parallel computers, is described. This environment will consist of an integrated collection of language interfaces, automatic and semi-automatic debugging and analysis tools, and operating system —all of which are made more flexible by the use of a knowledge-based implementation for the tools that make up PARSE. The programming paradigm supports the user freely choosing among three basic approaches /abstractions for programming a parallel machine: logic-based descriptive, sequential-control procedural, and parallel-control procedural programming. All of these result in efficient parallel execution. The current work discusses the methodology underlying PARSE, whereas the companion paper, “The PARSE Programming Paradigm — II: Software Development Support Tools,” details each of the component tools

    Parallel Computers and Complex Systems

    Get PDF
    We present an overview of the state of the art and future trends in high performance parallel and distributed computing, and discuss techniques for using such computers in the simulation of complex problems in computational science. The use of high performance parallel computers can help improve our understanding of complex systems, and the converse is also true --- we can apply techniques used for the study of complex systems to improve our understanding of parallel computing. We consider parallel computing as the mapping of one complex system --- typically a model of the world --- into another complex system --- the parallel computer. We study static, dynamic, spatial and temporal properties of both the complex systems and the map between them. The result is a better understanding of which computer architectures are good for which problems, and of software structure, automatic partitioning of data, and the performance of parallel machines

    A design methodology for portable software on parallel computers

    Get PDF
    This final report for research that was supported by grant number NAG-1-995 documents our progress in addressing two difficulties in parallel programming. The first difficulty is developing software that will execute quickly on a parallel computer. The second difficulty is transporting software between dissimilar parallel computers. In general, we expect that more hardware-specific information will be included in software designs for parallel computers than in designs for sequential computers. This inclusion is an instance of portability being sacrificed for high performance. New parallel computers are being introduced frequently. Trying to keep one's software on the current high performance hardware, a software developer almost continually faces yet another expensive software transportation. The problem of the proposed research is to create a design methodology that helps designers to more precisely control both portability and hardware-specific programming details. The proposed research emphasizes programming for scientific applications. We completed our study of the parallelizability of a subsystem of the NASA Earth Radiation Budget Experiment (ERBE) data processing system. This work is summarized in section two. A more detailed description is provided in Appendix A ('Programming Practices to Support Eventual Parallelism'). Mr. Chrisman, a graduate student, wrote and successfully defended a Ph.D. dissertation proposal which describes our research associated with the issues of software portability and high performance. The list of research tasks are specified in the proposal. The proposal 'A Design Methodology for Portable Software on Parallel Computers' is summarized in section three and is provided in its entirety in Appendix B. We are currently studying a proposed subsystem of the NASA Clouds and the Earth's Radiant Energy System (CERES) data processing system. This software is the proof-of-concept for the Ph.D. dissertation. We have implemented and measured the performance of a portion of this subsystem on the Intel iPSC/2 parallel computer. These results are provided in section four. Our future work is summarized in section five, our acknowledgements are stated in section six, and references for published papers associated with NAG-1-995 are provided in section seven

    Fortran refactoring for legacy systems

    Get PDF
    The motivation of this work comes from a Global Climate Model (GCM) Software which was in great need of being updated. This software was implemented by scientists in the ’80s as a result of meteorological research. Written in Fortran 77, this program has been used as an input to make climate predictions for the Southern Hemisphere. The execution to get a complete numerical data set takes several days. This software has been programmed using a sequential processing paradigm. In these days, where multicore processors are so widespread, the time that an execution takes to get a complete useful data set can be drastically reduced using this technology. As a first objective to reach this goal of reengineering we must be able to understand the source code. An essential Fortran code characteristic is that old source code versions became unreadable, not comprehensive and sometimes “ejects” the reader from the source code. In that way, we can not modify, update or improve unreadable source code. Then, as a first step to parallelize this code we must update it, turn it readable and easy to understand. The GCM has a very complex internal structure. The program is divided into about 300 .f (Fortran 77) files. These files generally implement only one Fortran subroutine. Less than 10% of the files are used for common blocks and constants. Approximately 25% of the lines in the source code are comments. The total number of Fortran source code lines is 58000. A detailed work within the source code brings to light that [74]: 1 About 230 routines are called/used at run time. Most of the runtime is spent in routines located at deep levels 5 to 7 in the dynamic call graph from the main routine. 2 The routine with most of the runtime (the top routine from now on) requires more than 9% of the total program runtime and is called about 315000 times. 3 The top 10 routines (the 10 routines at the top of the flat profile) require about 50% of total runtime. Two of them are related to intrinsic Fortran functions. Our first approach was using a scripting language and Find & Replace tools trying to upgrade the source code, this kind of code manipulation do not guarantee preservation of software behavior. Then, our goal was to develop an automated tool to transform legacy software in more understandable, comprehensible and readable applying refactoring as main technique. At the same time a catalog of transformation to be applied in Fortran code is needed as a guide to programmers through this process.Es revisado por: http://sedici.unlp.edu.ar/handle/10915/9703Facultad de Informátic

    Developing and Measuring Parallel Rule-Based Systems in a Functional Programming Environment

    Get PDF
    This thesis investigates the suitability of using functional programming for building parallel rule-based systems. A functional version of the well known rule-based system OPS5 was implemented, and there is a discussion on the suitability of functional languages for both building compilers and manipulating state. Functional languages can be used to build compilers that reflect the structure of the original grammar of a language and are, therefore, very suitable. Particular attention is paid to the state requirements and the state manipulation structures of applications such as a rule-based system because, traditionally, functional languages have been considered unable to manipulate state. From the implementation work, issues have arisen that are important for functional programming as a whole. They are in the areas of algorithms and data structures and development environments. There is a more general discussion of state and state manipulation in functional programs and how theoretical work, such as monads, can be used. Techniques for how descriptions of graph algorithms may be interpreted more abstractly to build functional graph algorithms are presented. Beyond the scope of programming, there are issues relating both to the functional language interaction with the operating system and to tools, such as debugging and measurement tools, which help programmers write efficient programs. In both of these areas functional systems are lacking. To address the complete lack of measurement tools for functional languages, a profiling technique was designed which can accurately measure the number of calls to a function , the time spent in a function, and the amount of heap space used by a function. From this design, a profiler was developed for higher-order, lazy, functional languages which allows the programmer to measure and verify the behaviour of a program. This profiling technique is designed primarily for application programmers rather than functional language implementors, and the results presented by the profiler directly reflect the lexical scope of the original program rather than some run-time representation. Finally, there is a discussion of generally available techniques for parallelizing functional programs in order that they may execute on a parallel machine. The techniques which are easier for the parallel systems builder to implement are shown to be least suitable for large functional applications. Those techniques that best suit functional programmers are not yet generally available and usable

    Automatic matching of legacy code to heterogeneous APIs: An idiomatic approach

    Get PDF
    Heterogeneous accelerators often disappoint. They provide the prospect of great performance, but only deliver it when using vendor specific optimized libraries or domain specific languages. This requires considerable legacy code modifications, hindering the adoption of heterogeneous computing. This paper develops a novel approach to automatically detect opportunities for accelerator exploitation. We focus on calculations that are well supported by established APIs: sparse and dense linear algebra, stencil codes and generalized reductions and histograms. We call them idioms and use a custom constraint-based Idiom Description Language (IDL) to discover them within user code. Detected idioms are then mapped to BLAS libraries, cuSPARSE and clSPARSE and two DSLs: Halide and Lift. We implemented the approach in LLVM and evaluated it on the NAS and Parboil sequential C/C++ benchmarks, where we detect 60 idiom instances. In those cases where idioms are a significant part of the sequential execution time, we generate code that achieves 1.26× to over 20× speedup on integrated and external GPUs
    corecore