250 research outputs found

    On the fly type specialization without type analysis

    Full text link
    Les langages de programmation typĂ©s dynamiquement tels que JavaScript et Python repoussent la vĂ©rification de typage jusqu’au moment de l’exĂ©cution. Afin d’optimiser la performance de ces langages, les implĂ©mentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’infĂ©rence de types. Cependant, les analyses de ce genre sont souvent coĂ»teuses et impliquent des compromis entre le temps de compilation et la prĂ©cision des rĂ©sultats obtenus. Ceci a conduit Ă  la conception d’architectures de VM de plus en plus complexes. Nous proposons le versionnement paresseux de blocs de base, une technique de compilation Ă  la volĂ©e simple qui Ă©limine efficacement les tests de typage dynamiques redondants sur les chemins d’exĂ©cution critiques. Cette nouvelle approche gĂ©nĂšre paresseusement des versions spĂ©cialisĂ©es des blocs de base tout en propageant de l’information de typage contextualisĂ©e. Notre technique ne nĂ©cessite pas l’utilisation d’analyses de programme coĂ»teuses, n’est pas contrainte par les limitations de prĂ©cision des analyses d’infĂ©rence de types traditionnelles et Ă©vite la complexitĂ© des techniques d’optimisation spĂ©culatives. Trois extensions sont apportĂ©es au versionnement de blocs de base afin de lui donner des capacitĂ©s d’optimisation interprocĂ©durale. Une premiĂšre extension lui donne la possibilitĂ© de joindre des informations de typage aux propriĂ©tĂ©s des objets et aux variables globales. Puis, la spĂ©cialisation de points d’entrĂ©e lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellĂ©es. Finalement, la spĂ©cialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellĂ©es aux appellants sans coĂ»t dynamique. Nous dĂ©montrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’infĂ©rence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures. We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques. Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable basic block versioning to exceed the capabilities of static whole-program type analyses

    Scaling finite difference methods in large eddy simulation of jet engine noise to the petascale: numerical methods and their efficient and automated implementation

    Get PDF
    Reduction of jet engine noise has recently become a new arena of competition between aircraft manufacturers. As a relatively new field of research in computational fluid dynamics (CFD), computational aeroacoustics (CAA) prediction of jet engine noise based on large eddy simulation (LES) is a robust and accurate tool that complements the existing theoretical and experimental approaches. In order to satisfy the stringent requirements of CAA on numerical accuracy, finite difference methods in LES-based jet engine noise prediction rely on the implicitly formulated compact spatial partial differentiation and spatial filtering schemes, a crucial component of which is an embedded solver for tridiagonal linear systems spatially oriented along the three coordinate directions of the computational space. Traditionally, researchers and engineers in CAA have employed manually crafted implementations of solvers including the transposition method, the multiblock method and the Schur complement method. Algorithmically, these solvers force a trade-off between numerical accuracy and parallel scalability. Programmingwise, implementing them for each of the three coordinate directions is tediously repetitive and error-prone. ^ In this study, we attempt to tackle both of these two challenges faced by researchers and engineers. We first describe an accurate and scalable tridiagonal linear system solver as a specialization of the truncated SPIKE algorithm and strategies for efficient implementation of the compact spatial partial differentiation and spatial filtering schemes. We then elaborate on two programming models tailored for composing regular grid-based numerical applications including finite difference-based LES of jet engine noise, one based on generalized elemental subroutines and the other based on functional array programming, and the accompanying code optimization and generation methodologies. Through empirical experiments, we demonstrate that truncated SPIKE-based spatial partial differentiation and spatial filtering deliver the theoretically promised optimal scalability in weak scaling conditions and can be implemented using the two programming models with performance on par with handwritten code while significantly reducing the required programming effort

    SUDS : automatic parallelization for raw processors

    Get PDF
    Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2003.Includes bibliographical references (p. 177-181).A computer can never be too fast or too cheap. Computer systems pervade nearly every aspect of science, engineering, communications and commerce because they perform certain tasks at rates unachievable by any other kind of system built by humans. A computer system's throughput, however, is constrained by that system's ability to find concurrency. Given a particular target work load the computer architect's role is to design mechanisms to find and exploit the available concurrency in that work load. This thesis describes SUDS (Software Un-Do System), a compiler and runtime system that can automatically find and exploit the available concurrency of scalar operations in imperative programs with arbitrary unstructured and unpredictable control flow. The core compiler transformation that enables this is scalar queue conversion. Scalar queue conversion makes scalar renaming an explicit operation through a process similar to closure conversion, a technique traditionally used to compile functional languages. The scalar queue conversion compiler transformation is speculative, in the sense that it may introduce dynamic memory allocation operations into code that would not otherwise dynamically allocate memory. Thus, SUDS also includes a transactional runtime system that periodically checkpoints machine state, executes code speculatively, checks if the speculative execution produced results consistent with the original sequential program semantics, and then either commits or rolls back the speculative execution path. In addition to safely running scalar queue converted code, the SUDS runtime system safely permits threads to speculatively run in parallel and concurrently issue memory operations, even when the compiler is unable to prove that the reordered memory operations will always produce correct results.(cont.) Using this combination of compile time and runtime techniques, SUDS can find concurrency in programs where previous compiler based renaming techniques fail because the programs contain unstructured loops, and where Tomasulo's algorithm fails because it sequentializes mispredicted branches. Indeed, we describe three application programs, with unstructured control flow, where the prototype SUDS system, running in software on a Raw microprocessor, achieves speedups equivalent to, or better than, an idealized, and unrealizable, model of a hardware implementation of Tomasulo's algorithm.by Matthew Ian Frank.Ph.D

    Low-Level Haskell Code: Measurements and Optimization Techniques

    Get PDF
    Haskell is a lazy functional language with a strong static type system and excellent support for parallel programming. The language features of Haskell make it easier to write correct and maintainable programs, but execution speed often suffers from the high levels of abstraction. While much past research focuses on high-level optimizations that take advantage of the functional properties of Haskell, relatively little attention has been paid to the optimization opportunities in the low-level imperative code generated during translation to machine code. One problem with current low-level optimizations is that their effectiveness is limited by the obscured control flow caused by Haskell's high-level abstractions. My thesis is that trace-based optimization techniques can be used to improve the effectiveness of low-level optimizations for Haskell programs. I claim three unique contributions in this work. The first contribution is to expose some properties of low-level Haskell codes by looking at the mix of operations performed by the selected benchmark codes and comparing them to the low-level codes coming from traditional programming languages. The low-level measurements reveal that the control flow is obscured by indirect jumps caused by the implementation of lazy evaluation, higher-order functions, and the separately managed stacks used by Haskell programs. My second contribution is a study on the effectiveness of a dynamic binary trace-based optimizer running on Haskell programs. My results show that while viable program traces frequently occur in Haskell programs the overhead associated with maintaing the traces in a dynamic optimization system outweigh the benefits we get from running the traces. To reduce the runtime overheads, I explore a way to find traces in a separate profiling step. My final contribution is to build and evaluate a static trace-based optimizer for Haskell programs. The static optimizer uses profiling data to find traces in a Haskell program and then restructures the code around the traces to increase the scope available to the low-level optimizer. My results show that we can successfully build traces in Haskell programs, and the optimized code yields a speedup over existing low-level optimizers of up to 86% with an average speedup of 5% across 32 benchmarks

    Adaptive memory hierarchies for next generation tiled microarchitectures

    Get PDF
    Les Ășltimes dĂšcades el rendiment dels processadors i de les memĂČries ha millorat a diferent ritme, limitant el rendiment dels processadors i creant el conegut memory gap. Sol·lucionar aquesta diferĂšncia de rendiment Ă©s un camp d'investigaciĂł d'actualitat i que requereix de noves sol·lucions. Una sol·luciĂł a aquest problema sĂłn les memĂČries “cache”, que permeten reduĂŻr l'impacte d'unes latĂšncies de memĂČria creixents i que conformen la jerarquia de memĂČria. La majoria de d'organitzacions de les “caches” estan dissenyades per a uniprocessadors o multiprcessadors tradicionals. Avui en dia, perĂČ, el creixent nombre de transistors disponible per xip ha permĂšs l'apariciĂł de xips multiprocessador (CMPs). Aquests xips tenen diferents propietats i limitacions i per tant requereixen de jerarquies de memĂČria especĂ­fiques per tal de gestionar eficientment els recursos disponibles. En aquesta tesi ens hem centrat en millorar el rendiment i la eficiĂšncia energĂštica de la jerarquia de memĂČria per CMPs, des de les “caches” fins als controladors de memĂČria. A la primera part d'aquesta tesi, s'han estudiat organitzacions tradicionals per les “caches” com les privades o compartides i s'ha pogut constatar que, tot i que funcionen bĂ© per a algunes aplicacions, un sistema que s'ajustĂ©s dinĂ micament seria mĂ©s eficient. TĂšcniques com el Cooperative Caching (CC) combinen els avantatges de les dues tĂšcniques perĂČ requereixen un mecanisme centralitzat de coherĂšncia que tĂ© un consum energĂštic molt elevat. És per aixĂČ que en aquesta tesi es proposa el Distributed Cooperative Caching (DCC), un mecanisme que proporciona coherĂšncia en CMPs i aplica el concepte del cooperative caching de forma distribuĂŻda. Mitjançant l'Ășs de directoris distribuĂŻts s'obtĂ© una sol·luciĂł mĂ©s escalable i que, a mĂ©s, disposa d'un mecanisme de marcatge mĂ©s flexible i eficient energĂšticament. A la segona part, es demostra que les aplicacions fan diferents usos de la “cache” i que si es realitza una distribuciĂł de recursos eficient es poden aprofitar els que estan infrautilitzats. Es proposa l'Elastic Cooperative Caching (ElasticCC), una organitzaciĂł capaç de redistribuĂŻr la memĂČria “cache” dinĂ micament segons els requeriments de cada aplicaciĂł. Una de les contribucions mĂ©s importants d'aquesta tĂšcnica Ă©s que la reconfiguraciĂł es decideix completament a travĂ©s del maquinari i que tots els mecanismes utilitzats es basen en estructures distribuĂŻdes, permetent una millor escalabilitat. ElasticCC no nomĂ©s Ă©s capaç de reparticionar les “caches” segons els requeriments de cada aplicaciĂł, sinĂł que, a mĂ©s a mĂ©s, Ă©s capaç d'adaptar-se a les diferents fases d'execuciĂł de cada una d'elles. La nostra avaluaciĂł tambĂ© demostra que la reconfiguraciĂł dinĂ mica de l'ElasticCC Ă©s tant eficient que gairebĂ© proporciona la mateixa taxa de fallades que una configuraciĂł amb el doble de memĂČria.Finalment, la tesi es centra en l'estudi del comportament de les memĂČries DRAM i els seus controladors en els CMPs. Es demostra que, tot i que els controladors tradicionals funcionen eficientment per uniprocessadors, en CMPs els diferents patrons d'accĂ©s obliguen a repensar com estan dissenyats aquests sistemes. S'han presentat mĂșltiples sol·lucions per CMPs perĂČ totes elles es veuen limitades per un compromĂ­s entre el rendiment global i l'equitat en l'assignaciĂł de recursos. En aquesta tesi es proposen els Thread Row Buffers (TRBs), una zona d'emmagatenament extra a les memĂČries DRAM que permetria guardar files de dades especĂ­fiques per a cada aplicaciĂł. Aquest mecanisme permet proporcionar un accĂ©s equitatiu a la memĂČria sense perjudicar el seu rendiment global. En resum, en aquesta tesi es presenten noves organitzacions per la jerarquia de memĂČria dels CMPs centrades en la escalabilitat i adaptativitat als requeriments de les aplicacions. Els resultats presentats demostren que les tĂšcniques proposades proporcionen un millor rendiment i eficiĂšncia energĂštica que les millors tĂšcniques existents fins a l'actualitat.Processor performance and memory performance have improved at different rates during the last decades, limiting processor performance and creating the well known "memory gap". Solving this performance difference is an important research field and new solutions must be proposed in order to have better processors in the future. Several solutions exist, such as caches, that reduce the impact of longer memory accesses and conform the system memory hierarchy. However, most of the existing memory hierarchy organizations were designed for single processors or traditional multiprocessors. Nowadays, the increasing number of available transistors has allowed the apparition of chip multiprocessors, which have different constraints and require new ad-hoc memory systems able to efficiently manage memory resources. Therefore, in this thesis we have focused on improving the performance and energy efficiency of the memory hierarchy of chip multiprocessors, ranging from caches to DRAM memories. In the first part of this thesis we have studied traditional cache organizations such as shared or private caches and we have seen that they behave well only for some applications and that an adaptive system would be desirable. State-of-the-art techniques such as Cooperative Caching (CC) take advantage of the benefits of both worlds. This technique, however, requires the usage of a centralized coherence structure and has a high energy consumption. Therefore we propose the Distributed Cooperative Caching (DCC), a mechanism to provide coherence to chip multiprocessors and apply the concept of cooperative caching in a distributed way. Through the usage of distributed directories we obtain a more scalable solution and, in addition, has a more flexible and energy-efficient tag allocation method. We also show that applications make different uses of cache and that an efficient allocation can take advantage of unused resources. We propose Elastic Cooperative Caching (ElasticCC), an adaptive cache organization able to redistribute cache resources dynamically depending on application requirements. One of the most important contributions of this technique is that adaptivity is fully managed by hardware and that all repartitioning mechanisms are based on distributed structures, allowing a better scalability. ElasticCC not only is able to repartition cache sizes to application requirements, but also is able to dynamically adapt to the different execution phases of each thread. Our experimental evaluation also has shown that the cache partitioning provided by ElasticCC is efficient and is almost able to match the off-chip miss rate of a configuration that doubles the cache space. Finally, we focus in the behavior of DRAM memories and memory controllers in chip multiprocessors. Although traditional memory schedulers work well for uniprocessors, we show that new access patterns advocate for a redesign of some parts of DRAM memories. Several organizations exist for multiprocessor DRAM schedulers, however, all of them must trade-off between memory throughput and fairness. We propose Thread Row Buffers, an extended storage area in DRAM memories able to store a data row for each thread. This mechanism enables a fair memory access scheduling without hurting memory throughput. Overall, in this thesis we present new organizations for the memory hierarchy of chip multiprocessors which focus on the scalability and of the proposed structures and adaptivity to application behavior. Results show that the presented techniques provide a better performance and energy-efficiency than existing state-of-the-art solutions

    July 2018 news releases

    Get PDF

    Compiling Scala for Performance

    Get PDF
    Scala is a new programming language bringing together object-oriented and functional programming. Its defining features are uniformity and extensibility. Scala offers great flexibility for programmers, allowing them to grow the language through libraries. Oftentimes what seems like a language feature is in fact implemented in a library, effectively giving programmers the power of language designers. The downside of this flexibility is that familiar looking code may hide unexpected performance costs. It is important for Scala compilers to bring down this cost as much as possible. We identify several areas of impact for Scala performance: higher-order functions and closures, and generic containers used with primitive types. We present two complementary approaches for improving performance in these areas: optimizations and specialization. Compiler optimization can bring down the cost through a combination of aggressive inlining of higher-order functions, an extended version of copy-propagation and dead-code elimination. Both anonymous functions and boxing can be eliminated by this approach. We show on a number of benchmarks that these language features can be up to 5 times faster when properly optimized, on current day JVMs. We propose a new approach to compiling parametric polymorphism for performance at primitive types. We mix a homogeneous translation scheme with user-directed specialization for primitive types. Type parameters may be annotated to require specialization of code depending on them. We propose definition-site specialization for primitive types, achieving separate compilation and no boxing when both the definition and call site are specialized. Specialized classes are compatible with unspecialized code, and specialization agnostic code can work with specialized instances, meaning that specialization is opportunistic. We present a formalism of a small subset of Scala with specialization and prove that specialization preserves types. We implemented this translation in the Scala compiler and report on improvements on a set of benchmarks, showing that specialization can make programs more than two times faster

    Design and optimisation of scientific programs in a categorical language

    Get PDF
    This thesis presents an investigation into the use of advanced computer languages for scientific computing, an examination of performance issues that arise from using such languages for such a task, and a step toward achieving portable performance from compilers by attacking these problems in a way that compensates for the complexity of and differences between modern computer architectures. The language employed is Aldor, a functional language from computer algebra, and the scientific computing area is a subset of the family of iterative linear equation solvers applied to sparse systems. The linear equation solvers that are considered have much common structure, and this is factored out and represented explicitly in the lan-guage as a framework, by means of categories and domains. The flexibility introduced by decomposing the algorithms and the objects they act on into separate modules has a strong performance impact due to its negative effect on temporal locality. This necessi-tates breaking the barriers between modules to perform cross-component optimisation. In this instance the task reduces to one of collective loop fusion and array contrac
    • 

    corecore