    A type-based prototype compiler for telescoping languages

    Scientists want to encode their applications in domain languages with high-level operators that reflect the way they conceptualize computations in their domains. Telescoping languages calls for automatically generating optimizing compilers for these languages by pre-compiling the underlying libraries that define them to generate multiple variants optimized for use in different possible contexts, including different argument types. The resulting compiler replaces calls to the high-level constructs with calls to the optimized variants. This approach aims to automatically derive high-performance executables from programs written in high-level domain-specific languages. TeleGen is a prototype telescoping-languages compiler that performs type-based specializations. For the purposes of this dissertation, types include any set of variable properties such as intrinsic type, size and array sparsity pattern. Type inference and specialization are cornerstones of the telescoping-languages strategy. Because optimization of library routines must occur before their full calling contexts are available, type inference gives critical information needed to determine which specialized variants to generate as well as how to best optimize each variant to achieve the highest performance. To build the prototype compiler, we developed a precise type-inference algorithm that infers all legal type tuples, or type configurations, for the program variables, including routine arguments, for all legal calling contexts. We use the type information inferred by our algorithm to drive specialization and optimization. We demonstrate the practical value of our type-inference algorithm and the type-based specialization strategy in TeleGen


    A long-standing challenge in High-Performance Computing (HPC) is the simultaneous achievement of programmer productivity and hardware computational efficiency. The challenge has been exacerbated by the onset of multi- and many-core CPUs and accelerators. Only a few expert programmers have been able to hand-code domain-specific data transformations and vectorization schemes needed to extract the best possible performance on such architectures. In this research, we examined the possibility of automating these methods by developing a Domain-Specific Language (DSL) framework. Our DSL approach extends C++14 by embedding into it a high-level data-parallel array language, and by using a domain-specific compiler to compile to hybrid-parallel code. We also implemented an array index-space transformation algebra within this high-level array language to manipulate array data-layouts and data-distributions. The compiler introduces a novel method for SIMD auto-vectorization based on array data-layouts. Our new auto-vectorization technique is shown to outperform the default auto-vectorization strategy by up to 40% for stencil computations. The compiler also automates distributed data movement with overlapping of local compute with remote data movement using polyhedral integer set analysis. Along with these main innovations, we developed a new technique using C++ template metaprogramming for developing embedded DSLs using C++. We also proposed a domain-specific compiler intermediate representation that simplifies data flow analysis of abstract DSL constructs. We evaluated our framework by constructing a DSL for the HPC grand-challenge domain of lattice quantum chromodynamics. Our DSL yielded performance gains of up to twice the flop rate over existing production C code for selected kernels. This gain in performance was obtained while using less than one-tenth the lines of code. The performance of this DSL was also competitive with the best hand-optimized and hand-vectorized code, and is an order of magnitude better than existing production DSLs.Doctor of Philosoph

    Array optimizations for high productivity programming languages

    While the HPCS languages (Chapel, Fortress and X10) have introduced improvements in programmer productivity, several challenges still remain in delivering high performance. In the absence of optimization, the high-level language constructs that improve productivity can result in order-of-magnitude runtime performance degradations. This dissertation addresses the problem of efficient code generation for high-level array accesses in the X10 language. The X10 language supports rank-independent specification of loop and array computations using regions and points. Three aspects of high-level array accesses in X10 are important for productivity but also pose significant performance challenges: high-level accesses are performed through Point objects rather than integer indices, variables containing references to arrays are rank-independent, and array subscripts are verified as legal array indices during runtime program execution. Our solution to the first challenge is to introduce new analyses and transformations that enable automatic inlining and scalar replacement of Point objects. Our solution to the second challenge is a hybrid approach. We use an interprocedural rank analysis algorithm to automatically infer ranks of arrays in X10. We use rank analysis information to enable storage transformations on arrays. If rank-independent array references still remain after compiler analysis, the programmer can use X10's dependent type system to safely annotate array variable declarations with additional information for the rank and region of the variable, and to enable the compiler to generate efficient code in cases where the dependent type information is available. Our solution to the third challenge is to use a new interprocedural array bounds analysis approach using regions to automatically determine when runtime bounds checks are not needed. Our performance results show that our optimizations deliver performance that rivals the performance of hand-tuned code with explicit rank-specific loops and lower-level array accesses, and is up to two orders of magnitude faster than unoptimized, high-level X10 programs. These optimizations also result in scalability improvements of X10 programs as we increase the number of CPUs. While we perform the optimizations primarily in X10, these techniques are applicable to other high-productivity languages such as Chapel and Fortress

    ISCR Annual Report: Fical Year 2004

    Compilation and Code Optimization for Data Analytics

    The trade-offs between the use of modern high-level and low-level programming languages in constructing complex software artifacts are well known. High-level languages allow for greater programmer productivity: abstraction and genericity allow for the same functionality to be implemented with significantly less code compared to low-level languages. Modularity, object-orientation, functional programming, and powerful type systems allow programmers not only to create clean abstractions and protect them from leaking, but also to define code units that are reusable and easily composable, and software architectures that are adaptable and extensible. The abstraction, succinctness, and modularity of high-level code help to avoid software bugs and facilitate debugging and maintenance. The use of high-level languages comes at a performance cost: increased indirection due to abstraction, virtualization, and interpretation, and superfluous work, particularly in the form of tempory memory allocation and deallocation to support objects and encapsulation. As a result of this, the cost of high-level languages for performance-critical systems may seem prohibitive. The vision of abstraction without regret argues that it is possible to use high-level languages for building performance-critical systems that allow for both productivity and high performance, instead of trading off the former for the latter. In this thesis, we realize this vision for building different types of data analytics systems. Our means of achieving this is by employing compilation. The goal is to compile away expensive language features -- to compile high-level code down to efficient low-level code

    Eine generische komponentenbasierte Software-Architektur zur Simulation probabilistischer Modelle

    An uncertain behaviour is in the nature of many physical phenomena. This uncertainty has to be quantified for a meaningful prediction by a computer-aided simulation. A stochastic description of the uncertainty carries a physical phenomenon over to a probabilistic model, which is usually solved by numerical schemes. The present thesis discusses and develops models for challenging uncertain physical phenomena, efficient numerical schemes for a quantification of uncertainties (UQ), and a sustainable and efficient software implementation. Probabilistic models are often described by stochastic partial differential equations (SPDEs). The stochastic Galerkin method represents the solution of an SPDE by a set of stochastic basis polynomials. A problem-independent choice of basis polynomials typically limits the application to relatively small maximum polynomial degrees. Moreover, many coefficients have to be computed and stored. In this thesis new error-controlled low-rank schemes are presented, which in addition select relevant basis polynomials. In this manner the previously mentioned problems are addressed. The complexity of a UQ is as well reflected in the software implementation. A sustainable implementation relies on a reuse of software. Here, a software architecture for the simulation of probabilistic models is presented, which is based on distributed generic components. Many of these components are reused in different frameworks (and may also be used beyond a UQ). They can be instantiated in a distributed system many times and are interchangeable at runtime, where the generic aspect is preserved. Probabilistic models are derived and simulated in this thesis, which for instance describe uncertainties for a composite material and an aircraft design. Among other things, several hundred stochastic dimensions or a long runtime for simulations arise.Ein unsicheres Verhalten liegt in der Natur vieler physikalischer Phänomene. Diese Unsicherheit muss für eine sinnvolle Prognose durch eine Computer-gestützte Simulation quantifiziert werden. Eine stochastische Beschreibung der Unsicherheit überführt ein physikalisches Phänomen in ein probabilistisches Modell, das üblicherweise durch numerische Verfahren gelöst wird. Die vorliegende Arbeit behandelt und entwickelt Modelle für anspruchsvolle und mit Unsicherheit behaftete physikalische Phänomene, effiziente numerische Verfahren für eine Unsicherheitsquantifizierung (UQ) und eine nachhaltige und leistungsfähige Software-Umsetzung. Probabilistische Modelle werden häufig durch stochastische partielle Differentialgleichungen (SPDGLn) beschrieben. Die stochastische Galerkin Methode stellt die Lösung einer SPDGL durch eine endliche Menge an stochastischen Basispolynomen dar. Eine problemunabhängige Wahl von Basispolynomen beschränkt die Anwendung typischerweise auf relativ kleine maximale Polynomgrade. Des Weiteren müssen viele Koeffizienten berechnet und gespeichert werden. In dieser Arbeit werden neue fehlergesteuerte Niedrig-Rang Verfahren vorgestellt, die zudem relevante Basispolynome selektieren. Auf diese Weise wird den zuvor beschriebenen Problemen entgegen gegangen. Die Komplexität einer UQ schlägt sich ebenso auf die Software-Umsetzung nieder. Eine nachhaltige Umsetzung setzt auf die Wiederverwendbarkeit von Software. Hier wird eine auf verteilten und generischen Komponenten basierende Software-Architektur zur Simulation probabilistischer Modelle vorgestellt. Viele dieser Komponenten werden in verschiedenen Frameworks wiederverwendet (und mögen auch außerhalb einer UQ zum Einsatz kommen). Sie können mehrfach in einem verteilten System instanziiert und zur Laufzeit ausgetauscht werden, wobei der generische Aspekt erhalten bleibt. Probabilistische Modelle beispielsweise zur Beschreibung von Unsicherheiten in einem Kompositwerkstoff und einem Flugzeugentwurf werden in dieser Arbeit hergeleitet und simuliert. Dabei treten mitunter mehrere hundert stochastische Dimensionen oder lange Simulationslaufzeiten auf

    Compiling Parallel MATLAB for General Distributions using Telescoping Languages

    Matlab is one of the most popular computer languages for technical and scientific programming. However, until recently, it has been limited to running on uniprocessors. One strategy for overcoming this limitation is to introduce global distributed arrays, with those arrays distributed across the processors of a parallel machine. In this paper, we describe the compilation technology we have designed for Matlab D, a distributed-array extension of Matlab. Our approach is distinguished by a two-phase compilation technology with support for a rich collection of data distributions. By precompiling array operations and communication steps into Fortran plus MPI, the time to compile an application using those operations is significantly reduced. This paper includes preliminary results that demonstrate that this approach can dramatically improve performance, scaling well to at least 32 processors

