458 research outputs found
Optimizing JavaScript Engines for Modern-day Workloads
In modern times, we have seen tremendous increase in popularity and usage of web-based applications. Applications such as presentation softwareand word processors, which were traditionally considered desktop applications are being ported to the web by compiling them to JavaScript. Since JavaScript is the de facto language of the web, JavaScript engine performance significantly affects the overall web application experience. JavaScript, initially intended solely as a client-side scripting language for web browsers, is now being used to implement server-side web applications (node.js) that traditionally have been written in languages like Java. Web application developers expect "C"-like performance out of their applications. Thus, there is a need to reevaluate the optimization strategies implemented in the modern day engines.Thesis statement: I propose that by using run-time and ahead-of-time profiling and type specialization techniques it is possible to improve the performance of JavaScript engines to cater to the needs of modern-day workloads.In this dissertation, we present an improved synergistic type specialization strategy for optimized JavaScript code execution, implemented on top of a research JavaScript engine called MuscalietJS. Our technique combines type feedback and type inference to reinforce and augment each other in a unique way. We then present a novel deoptimization strategy that enables type specialized code generation on top of typed, stack-based virtual machines like CLR. We also describe a server-side offline profiling technique to collect profile information for web application which helps client JavaScript engines (running in the browser) avoid deoptimizations and improve performance of the applications. Finally, we describe a technique to improve the performance of server-side JavaScript code by making use of intelligent profile caching and two new type stability heuristics
Expression Acceleration: Seamless Parallelization of Typed High-Level Languages
Efficient parallelization of algorithms on general-purpose GPUs is today
essential in many areas. However, it is a non-trivial task for software
engineers to utilize GPUs to improve the performance of high-level programs in
general. Although many domain-specific approaches are available for GPU
acceleration, it is difficult to accelerate existing high-level programs
without rewriting parts of the programs using low-level GPU code. In this
paper, we propose a different approach, where expressions are marked for
acceleration, and the compiler automatically infers which code needs to be
accelerated. We call this approach expression acceleration. We design a
compiler pipeline for the approach and show how to handle several challenges,
including expression extraction, well-formedness, and compiling using multiple
backends. The approach is designed and implemented within a statically-typed
functional intermediate language and evaluated using three distinct non-trivial
case studies
On the fly type specialization without type analysis
Les langages de programmation typés dynamiquement tels que JavaScript et Python repoussent la vérification de typage jusqu’au moment de l’exécution. Afin d’optimiser la performance de ces langages, les implémentations de machines virtuelles pour langages dynamiques doivent tenter d’éliminer les tests de typage dynamiques redondants. Cela se fait habituellement en utilisant une analyse d’inférence de types. Cependant, les analyses de ce genre sont souvent coûteuses et impliquent des compromis entre le temps de compilation et la précision des résultats obtenus. Ceci a conduit à la conception d’architectures de VM de plus en plus complexes.
Nous proposons le versionnement paresseux de blocs de base, une technique de compilation à la volée simple qui élimine efficacement les tests de typage dynamiques redondants sur les chemins d’exécution critiques. Cette nouvelle approche génère paresseusement des versions spécialisées des blocs de base tout en propageant de l’information de typage contextualisée. Notre technique ne nécessite pas l’utilisation d’analyses de programme coûteuses, n’est pas contrainte par les limitations de précision des analyses d’inférence de types traditionnelles et évite la complexité des techniques d’optimisation spéculatives.
Trois extensions sont apportées au versionnement de blocs de base afin de lui donner des capacités d’optimisation interprocédurale. Une première extension lui donne la possibilité de joindre des informations de typage aux propriétés des objets et aux variables globales. Puis, la spécialisation de points d’entrée lui permet de passer de l’information de typage des fonctions appellantes aux fonctions appellées. Finalement, la spécialisation des continuations d’appels permet de transmettre le type des valeurs de retour des fonctions appellées aux appellants sans coût dynamique. Nous démontrons empiriquement que ces extensions permettent au versionnement de blocs de base d’éliminer plus de tests de typage dynamiques que toute analyse d’inférence de typage statique.Dynamically typed programming languages such as JavaScript and Python defer type checking to run time. In order to maximize performance, dynamic language virtual
machine implementations must attempt to eliminate redundant dynamic type checks. This is typically done using type inference analysis. However, type inference analyses
are often costly and involve tradeoffs between compilation time and resulting precision. This has lead to the creation of increasingly complex multi-tiered VM architectures.
We introduce lazy basic block versioning, a simple just-in-time compilation technique which effectively removes redundant type checks from critical code paths. This
novel approach lazily generates type-specialized versions of basic blocks on the fly while propagating context-dependent type information. This does not require the use of costly
program analyses, is not restricted by the precision limitations of traditional type analyses and avoids the implementation complexity of speculative optimization techniques.
Three extensions are made to the basic block versioning technique in order to give it interprocedural optimization capabilities. Typed object shapes give it the ability to
attach type information to object properties and global variables. Entry point specialization allows it to pass type information from callers to callees, and call continuation
specialization makes it possible to pass return value type information back to callers without dynamic overhead. We empirically demonstrate that these extensions enable
basic block versioning to exceed the capabilities of static whole-program type analyses
Preemptive type checking in dynamically typed programs
With the rise of languages such as JavaScript, dynamically typed languages have gained a strong foothold in the programming language landscape. These languages are very well suited for rapid prototyping and for use with agile programming methodologies. However, programmers would benefit from the ability to detect type errors in their code early, without imposing unnecessary restrictions on their programs.Here we describe a new type inference system that identifies potential type errors through a flow-sensitive static analysis. This analysis is invoked at a very late stage, after the compilation to bytecode and initialisation of the program. It computes for every expression the variable’s present (from the values that it has last been assigned) and future (with which it is used in the further program execution) types, respectively. Using this information, our mechanism inserts type checks at strategic points in the original program. We prove that these checks, inserted as early as possible, preempt type errors earlier than existing type systems. We further show that these checks do not change the semantics of programs that do not raise type errors.Preemptive type checking can be added to existing languages without the need to modify the existing runtime environment. We show this with an implementation for the Python language and demonstrate its effectiveness on a number of benchmarks
Opportunistic acceleration of array-centric Python computation in heterogeneous environments
Dynamic scripting languages, like Python, are growing in popularity and increasingly used by non-expert programmers. These languages provide high level abstractions such as safe memory management, dynamic type handling and array bounds checking. The reduction in boilerplate code enables the concise expression of computation compared to statically typed and compiled languages. This improves programmer productivity. Increasingly, scripting languages are used by domain experts to write numerically intensive code in a variety of domains (e.g. Economics, Zoology, Archaeology and Physics). These programs are often used not just for prototyping but also in deployment. However, such managed program execution comes with a significant performance penalty arising from the interpreter having to decode and dispatch based on dynamic type checking.
Modern computer systems are increasingly equipped with accelerators such as GPUs. However, the massive speedups that can be achieved by GPU accelerators come at the cost of program complexity. Directly programming a GPU requires a deep understanding of the computational model of the underlying hardware architecture. While the complexity of such devices is abstracted by programming languages specialised for heterogeneous devices such as CUDA and OpenCL, these are dialects of the low-level C systems programming language used primarily by expert programmers.
This thesis presents the design and implementation of ALPyNA, a loop parallelisation and GPU code generation framework. A novel staged parallelisation approach is used to aggressively parallelise each execution instance of a loop nest. Loop dependence relationships that cannot be inferred statically are deferred for runtime analysis. At runtime, these dependences are augmented with runtime information obtained by introspection and the loop nest is parallelised. Parallel GPU kernels are customised to the runtime dependence graph, JIT compiled and executed.
A systematic analysis of the execution speed of loop nests is performed using 12 standard loop intensive benchmarks. The evaluation is performed on two CPU–GPU machines. One is a server grade machine while the other is a typical desktop. ALPyNA’s GPU kernels achieve orders of magnitude speedup over the baseline interpreter execution time (up to 16435x) and large speedups (up to 179.55x) over JIT compiled CPU code.
The varied performance of JIT compiled GPU code motivates the need for a sophisticated cost model to select the device providing the best speedups at runtime for varying domain sizes. This thesis describes a novel lightweight analytical cost model to determine the fastest device to execute a loop nest at runtime. The ALPyNA Cost Model (ACM) adapts to runtime dependence analysis and is parameterised on the hardware characteristics of the underlying target CPU or GPU. The cost model also takes into account the relative rate at which the interpreter is able to supply the GPU with computational work. ACM is re-targetable to other accelerator devices and only requires minimal install time profiling
Subheap-Augmented Garbage Collection
Automated memory management avoids the tedium and danger of manual techniques. However, as no programmer input is required, no widely available interface exists to permit principled control over sometimes unacceptable performance costs. This dissertation explores the idea that performance-oriented languages should give programmers greater control over where and when the garbage collector (GC) expends effort. We describe an interface and implementation to expose heap partitioning and collection decisions without compromising type safety. We show that our interface allows the programmer to encode a form of reference counting using Hayes\u27 notion of key objects. Preliminary experimental data suggests that our proposed mechanism can avoid high overheads suffered by tracing collectors in some scenarios, especially with tight heaps. However, for other applications, the costs of applying subheaps---in human effort and runtime overheads---remain daunting
Micro Virtual Machines: A Solid Foundation for Managed Language Implementation
Today new programming languages proliferate, but many of them
suffer from
poor performance and inscrutable semantics. We assert that the
root of
many of the performance and semantic problems of today's
languages is
that language implementation is extremely difficult. This
thesis
addresses the fundamental challenges of efficiently developing
high-level
managed languages.
Modern high-level languages provide abstractions over execution,
memory
management and concurrency. It requires enormous intellectual
capability
and engineering effort to properly manage these concerns.
Lacking such
resources, developers usually choose naive implementation
approaches
in the early stages of language design, a strategy which too
often has
long-term consequences, hindering the future development of the
language. Existing language development platforms have failed
to
provide the right level of abstraction, and forced implementers
to
reinvent low-level mechanisms in order to obtain performance.
My thesis is that the introduction of micro virtual machines will
allow
the development of higher-quality, high-performance managed
languages.
The first contribution of this thesis is the design of Mu, with
the
specification of Mu as the main outcome. Mu is
the first micro virtual machine, a robust, performant, and
light-weight
abstraction over just three concerns: execution, concurrency and
garbage
collection. Such a foundation attacks three of the most
fundamental and
challenging issues that face existing language designs and
implementations, leaving the language implementers free to focus
on the
higher levels of their language design.
The second contribution is an in-depth analysis of on-stack
replacement
and its efficient implementation. This low-level mechanism
underpins
run-time feedback-directed optimisation, which is key to the
efficient
implementation of dynamic languages.
The third contribution is demonstrating the viability of Mu
through
RPython, a real-world non-trivial language implementation. We
also did
some preliminary research of GHC as a Mu client.
We have created the Mu specification and its reference
implementation,
both of which are open-source. We show that that Mu's on-stack
replacement API can gracefully support dynamic languages such as
JavaScript, and it is implementable on concrete hardware. Our
RPython
client has been able to translate and execute non-trivial
RPython
programs, and can run the RPySOM interpreter and the core of the
PyPy
interpreter.
With micro virtual machines providing a low-level substrate,
language
developers now have the option to build their next language on a
micro
virtual machine. We believe that the quality of programming
languages
will be improved as a result
Actionable Program Analyses for Improving Software Performance
Nowadays, we have greater expectations of software than ever before. This is followed by constant pressure to run the same program on smaller and cheaper machines. To meet this demand, the application’s performance has become the essential concern in software development. Unfortunately, many applications still suffer from performance issues: coding or design errors that lead to performance degradation. However, finding performance issues is a challenging task: there is
limited knowledge on how performance issues are discovered and fixed in practice, and current performance profilers report only where resources are spent, but not where resources are wasted. The goal of this dissertation is to investigate actionable performance analyses that help developers optimize their software by applying relatively simple code changes. To understand causes and fixes of performance issues in real-world software, we first present an empirical study of 98 issues in popular JavaScript projects. The study illustrates the prevalence of simple and recurring optimization patterns that lead to significant performance improvements. Then, to help developers optimize their code, we propose two actionable performance analyses that suggest optimizations based on reordering opportunities and method inlining. In this work, we focus on optimizations with four key properties. First, the optimizations are effective, that is, the changes suggested by the analysis lead to statistically significant performance improvements. Second, the optimizations are exploitable, that is, they are easy to understand and apply. Third, the optimizations are recurring, that is, they are applicable across multiple projects. Fourth, the optimizations are out-of-reach for compilers, that is, compilers can not guarantee that a code transformation preserves the original semantics. To reliably detect optimization opportunities and measure their performance benefits, the code must be executed with sufficient test inputs. The last contribution complements state-of-the-art test generation techniques by proposing a novel automated approach for generating effective tests for higher-order functions. We implement our techniques in practical tools and evaluate their effectiveness on a set of popular software systems. The empirical evaluation demonstrates the potential of actionable analyses in improving software performance through relatively simple optimization opportunities
Bridging the Gap between Machine and Language using First-Class Building Blocks
High-performance virtual machines (VMs) are increasingly reused for programming languages for which they were not initially designed. Unfortunately, VMs are usually tailored to specific languages, offer only a very limited interface to running applications, and are closed to extensions. As a consequence, extensions required to support new languages often entail the construction of custom VMs, thus impacting reuse, compatibility and performance. Short of building a custom VM, the language designer has to choose between the expressiveness and the performance of the language. In this dissertation we argue that the best way to open the VM is to eliminate it. We present Pinocchio, a natively compiled Smalltalk, in which we identify and reify three basic building blocks for object-oriented languages. First we define a protocol for message passing similar to calling conventions, independent of the actual message lookup mechanism. The lookup is provided by a self-supporting runtime library written in Smalltalk and compiled to native code. Since it unifies the meta- and base-level we obtain a metaobject protocol (MOP). Then we decouple the language-level manipulation of state from the machine-level implementation by extending the structural reflective model of the language with object layouts, layout scopes and slots. Finally we reify behavior using AST nodes and first-class interpreters separate from the low-level language implementation. We describe the implementations of all three first-class building blocks. For each of the blocks we provide a series of examples illustrating how they enable typical extensions to the runtime, and we provide benchmarks validating the practicality of the approaches
- …