39 research outputs found

    XARK: an extensible framework for automatic recognition of computational kernels

    Get PDF
    This is a post-peer-review, pre-copyedit version of an article published in ACM Transactions on Programming Languages and Systems. The final authenticated version is available online at: http://dx.doi.org/10.1145/1391956.1391959[Abstract] The recognition of program constructs that are frequently used by software developers is a powerful mechanism for optimizing and parallelizing compilers to improve the performance of the object code. The development of techniques for automatic recognition of computational kernels such as inductions, reductions and array recurrences has been an intensive research area in the scope of compiler technology during the 90's. This article presents a new compiler framework that, unlike previous techniques that focus on specific and isolated kernels, recognizes a comprehensive collection of computational kernels that appear frequently in full-scale real applications. The XARK compiler operates on top of the Gated Single Assignment (GSA) form of a high-level intermediate representation (IR) of the source code. Recognition is carried out through a demand-driven analysis of this high-level IR at two different levels. First, the dependences between the statements that compose the strongly connected components (SCCs) of the data-dependence graph of the GSA form are analyzed. As a result of this intra-SCC analysis, the computational kernels corresponding to the execution of the statements of the SCCs are recognized. Second, the dependences between statements of different SCCs are examined in order to recognize more complex kernels that result from combining simpler kernels in the same code. Overall, the XARK compiler builds a hierarchical representation of the source code as kernels and dependence relationships between those kernels. This article describes in detail the collection of computational kernels recognized by the XARK compiler. Besides, the internals of the recognition algorithms are presented. The design of the algorithms enables to extend the recognition capabilities of XARK to cope with new kernels, and provides an advanced symbolic analysis framework to run other compiler techniques on demand. Finally, extensive experiments showing the effectiveness of XARK for a collection of benchmarks from different application domains are presented. In particular, the SparsKit-II library for the manipulation of sparse matrices, the Perfect benchmarks, the SPEC CPU2000 collection and the PLTMG package for solving elliptic partial differential equations are analyzed in detail.Ministeiro de Educación y Ciencia; TIN2004-07797-C02Ministeiro de Educación y Ciencia; TIN2007-67537-C03Xunta de Galicia; PGIDIT05PXIC10504PNXunta de Galicia; PGIDIT06PXIB105228P

    On Extracting Course-Grained Function Parallelism from C Programs

    Get PDF
    To efficiently utilize the emerging heterogeneous multi-core architecture, it is essential to exploit the inherent coarse-grained parallelism in applications. In addition to data parallelism, applications like telecommunication, multimedia, and gaming can also benefit from the exploitation of coarse-grained function parallelism. To exploit coarse-grained function parallelism, the common wisdom is to rely on programmers to explicitly express the coarse-grained data-flow between coarse-grained functions using data-flow or streaming languages. This research is set to explore another approach to exploiting coarse-grained function parallelism, that is to rely on compiler to extract coarse-grained data-flow from imperative programs. We believe imperative languages and the von Neumann programming model will still be the dominating programming languages programming model in the future. This dissertation discusses the design and implementation of a memory data-flow analysis system which extracts coarse-grained data-flow from C programs. The memory data-flow analysis system partitions a C program into a hierarchy of program regions. It then traverses the program region hierarchy from bottom up, summarizing the exposed memory access patterns for each program region, meanwhile deriving a conservative producer-consumer relations between program regions. An ensuing top-down traversal of the program region hierarchy will refine the producer-consumer relations by pruning spurious relations. We built an in-lining based prototype of the memory data-flow analysis system on top of the IMPACT compiler infrastructure. We applied the prototype to analyze the memory data-flow of several MediaBench programs. The experiment results showed that while the prototype performed reasonably well for the tested programs, the in-lining based implementation may not efficient for larger programs. Also, there is still room in improving the effectiveness of the memory data-flow analysis system. We did root cause analysis for the inaccuracy in the memory data-flow analysis results, which provided us insights on how to improve the memory data-flow analysis system in the future

    Compilation techniques for irregular problems on parallel machines

    Get PDF
    Massively parallel computers have ushered in the era of teraflop computing. Even though large and powerful machines are being built, they are used by only a fraction of the computing community. The fundamental reason for this situation is that parallel machines are difficult to program. Development of compilers that automatically parallelize programs will greatly increase the use of these machines.;A large class of scientific problems can be categorized as irregular computations. In this class of computation, the data access patterns are known only at runtime, creating significant difficulties for a parallelizing compiler to generate efficient parallel codes. Some compilers with very limited abilities to parallelize simple irregular computations exist, but the methods used by these compilers fail for any non-trivial applications code.;This research presents development of compiler transformation techniques that can be used to effectively parallelize an important class of irregular programs. A central aim of these transformation techniques is to generate codes that aggressively prefetch data. Program slicing methods are used as a part of the code generation process. In this approach, a program written in a data-parallel language, such as HPF, is transformed so that it can be executed on a distributed memory machine. An efficient compiler runtime support system has been developed that performs data movement and software caching

    A compiler level intermediate representation based binary analysis system and its applications

    Get PDF
    Analyzing and optimizing programs from their executables has received a lot of attention recently in the research community. There has been a tremendous amount of activity in executable-level research targeting varied applications such as security vulnerability analysis, untrusted code analysis, malware analysis, program testing, and binary optimizations. The vision of this dissertation is to advance the field of static analysis of executables and bridge the gap between source-level analysis and executable analysis. The main thesis of this work is scalable static binary rewriting and analysis using compiler-level intermediate representation without relying on the presence of metadata information such as debug or symbolic information. In spite of a significant overlap in the overall goals of several source-code methods and executables-level techniques, several sophisticated transformations that are well-understood and implemented in source-level infrastructures have yet to become available in executable frameworks. It is a well known fact that a standalone executable without any meta data is less amenable to analysis than the source code. Nonetheless, we believe that one of the prime reasons behind the limitations of existing executable frameworks is that current executable frameworks define their own intermediate representations (IR) which are significantly more constrained than an IR used in a compiler. Intermediate representations used in existing binary frameworks lack high level features like abstract stack, variables, and symbols and are even machine dependent in some cases. This severely limits the application of well-understood compiler transformations to executables and necessitates new research to make them applicable. In the first part of this dissertation, we present techniques to convert the binaries to the same high-level intermediate representation that compilers use. We propose methods to segment the flat address space in an executable containing undifferentiated blocks of memory. We demonstrate the inadequacy of existing variable identification methods for their promotion to symbols and present our methods for symbol promotion. We also present methods to convert the physically addressed stack in an executable to an abstract stack. The proposed methods are practical since they do not employ symbolic, relocation, or debug information which are usually absent in deployed executables. We have integrated our techniques with a prototype x86 binary framework called \emph{SecondWrite} that uses LLVM as the IR. The robustness of the framework is demonstrated by handling executables totaling more than a million lines of source-code, including several real world programs. In the next part of this work, we demonstrate that several well-known source-level analysis frameworks such as symbolic analysis have limited effectiveness in the executable domain since executables typically lack higher-level semantics such as program variables. The IR should have a precise memory abstraction for an analysis to effectively reason about memory operations. Our first work of recovering a compiler-level representation addresses this limitation by recovering several higher-level semantics information from executables. In the next part of this work, we propose methods to handle the scenarios when such semantics cannot be recovered. First, we propose a hybrid static-dynamic mechanism for recovering a precise and correct memory model in executables in presence of executable-specific artifacts such as indirect control transfers. Next, the enhanced memory model is employed to define a novel symbolic analysis framework for executables that can perform the same types of program analysis as source-level tools. Frameworks hitherto fail to simultaneously maintain the properties of correct representation and precise memory model and ignore memory-allocated variables while defining symbolic analysis mechanisms. We exemplify that our framework is robust, efficient and it significantly improves the performance of various traditional analyses like global value numbering, alias analysis and dependence analysis for executables. Finally, the underlying representation and analysis framework is employed for two separate applications. First, the framework is extended to define a novel static analysis framework, \emph{DemandFlow}, for identifying information flow security violations in program executables. Unlike existing static vulnerability detection methods for executables, DemandFlow analyzes memory locations in addition to symbols, thus improving the precision of the analysis. DemandFlow proposes a novel demand-driven mechanism to identify and precisely analyze only those program locations and memory accesses which are relevant to a vulnerability, thus enhancing scalability. DemandFlow uncovers six previously undiscovered format string and directory traversal vulnerabilities in popular ftp and internet relay chat clients. Next, the framework is extended to implement a platform-specific optimization for embedded processors. Several embedded systems provide the facility of locking one or more lines in the cache. We devise the first method in literature that employs instruction cache locking as a mechanism for improving the average-case run-time of general embedded applications. We demonstrate that the optimal solution for instruction cache locking can be obtained in polynomial time. Since our scheme is implemented inside a binary framework, it successfully addresses the portability concern by enabling the implementation of cache locking at the time of deployment when all the details of the memory hierarchy are available

    Hybrid analysis of memory references and its application to automatic parallelization

    Get PDF
    Executing sequential code in parallel on a multithreaded machine has been an elusive goal of the academic and industrial research communities for many years. It has recently become more important due to the widespread introduction of multicores in PCs. Automatic multithreading has not been achieved because classic, static compiler analysis was not powerful enough and program behavior was found to be, in many cases, input dependent. Speculative thread level parallelization was a welcome avenue for advancing parallelization coverage but its performance was not always optimal due to the sometimes unnecessary overhead of checking every dynamic memory reference. In this dissertation we introduce a novel analysis technique, Hybrid Analysis, which unifies static and dynamic memory reference techniques into a seamless compiler framework which extracts almost maximum available parallelism from scientific codes and incurs close to the minimum necessary run time overhead. We present how to extract maximum information from the quantities that could not be sufficiently analyzed through static compiler methods, and how to generate sufficient conditions which, when evaluated dynamically, can validate optimizations. Our techniques have been fully implemented in the Polaris compiler and resulted in whole program speedups on a large number of industry standard benchmark applications

    ATMOL: A Domain-Specific Language for Atmospheric Modeling

    Get PDF
    This paper describes the design and implementation of ATMOL: a domain-specific language for the formulation and implementation of atmospheric models. ATMOL was developed in close collaboration with meteorologists at the Royal Netherlands Meteorological Institute (KNMI) to ensure ease of use, concise notation, and the adoptation of common notational conventions. ATMOL’s expressiveness allows the formulation of high-level and low-level model details as language constructs for problem refinement and code synthesis. The atmospheric models specified in ATMOL are translated into efficient numerical codes with CTADEL, a tool for symbolic manipulation and code synthesis

    ATMOL: A Domain-Specific Language for Atmospheric Modeling

    Get PDF
    This paper describes the design and implementation of ATMOL: a domain-specific language for the formulation and implementation of atmospheric models. ATMOL was developed in close collaboration with meteorologists at the Royal Netherlands Meteorological Institute (KNMI) to ensure ease of use, concise notation, and the adoptation of common notational conventions. ATMOL’s expressiveness allows the formulation of high-level and low-level model details as language constructs for problem refinement and code synthesis. The atmospheric models specified in ATMOL are translated into efficient numerical codes with CTADEL, a tool for symbolic manipulation and code synthesis

    Logical Inference Techniques for Loop Parallelization

    Get PDF
    This paper presents a fully automatic approach to loop parallelization that integrates the use of static and run-time analysis and thus overcomes many known difficulties such as nonlinear and indirect array indexing and complex control flow. Our hybrid analysis framework validates the parallelization transformation by verifying the independence of the loop’s memory references. To this end it represents array references using the USR (uniform set representation) language and expresses the independence condition as an equation, S = ∅, where S is a set expression representing array indexes. Using a language instead of an array-abstraction representation for S results in a smaller number of conservative approximations but exhibits a potentially-high runtime cost. To alleviate this cost we introduce a language translation F from the USR set-expression language to an equally rich language of predicates (F(S) ⇒ S = ∅). Loop parallelization is then validated using a novel logic inference algorithm that factorizes the obtained complex predicates (F(S)) into a sequence of sufficient-independence conditions that are evaluated first statically and, when needed, dynamically, in increasing order of their estimated complexities. We evaluate our automated solution on 26 benchmarks from PERFECT-CLUB and SPEC suites and show that our approach is effective in parallelizing large, complex loops and obtains much better full program speedups than the Intel and IBM Fortran compilers

    An integrated soft- and hard-programmable multithreaded architecture

    Get PDF
    corecore