16 research outputs found
Recommended from our members
Guided Automatic Binary Parallelisation
For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries.
In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution.
The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation.
BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation.
BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program.
The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information.
BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms.
The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.St John's Benefactor Scholarship
ARM Sponsorshi
Micro Virtual Machines: A Solid Foundation for Managed Language Implementation
Today new programming languages proliferate, but many of them
suffer from
poor performance and inscrutable semantics. We assert that the
root of
many of the performance and semantic problems of today's
languages is
that language implementation is extremely difficult. This
thesis
addresses the fundamental challenges of efficiently developing
high-level
managed languages.
Modern high-level languages provide abstractions over execution,
memory
management and concurrency. It requires enormous intellectual
capability
and engineering effort to properly manage these concerns.
Lacking such
resources, developers usually choose naive implementation
approaches
in the early stages of language design, a strategy which too
often has
long-term consequences, hindering the future development of the
language. Existing language development platforms have failed
to
provide the right level of abstraction, and forced implementers
to
reinvent low-level mechanisms in order to obtain performance.
My thesis is that the introduction of micro virtual machines will
allow
the development of higher-quality, high-performance managed
languages.
The first contribution of this thesis is the design of Mu, with
the
specification of Mu as the main outcome. Mu is
the first micro virtual machine, a robust, performant, and
light-weight
abstraction over just three concerns: execution, concurrency and
garbage
collection. Such a foundation attacks three of the most
fundamental and
challenging issues that face existing language designs and
implementations, leaving the language implementers free to focus
on the
higher levels of their language design.
The second contribution is an in-depth analysis of on-stack
replacement
and its efficient implementation. This low-level mechanism
underpins
run-time feedback-directed optimisation, which is key to the
efficient
implementation of dynamic languages.
The third contribution is demonstrating the viability of Mu
through
RPython, a real-world non-trivial language implementation. We
also did
some preliminary research of GHC as a Mu client.
We have created the Mu specification and its reference
implementation,
both of which are open-source. We show that that Mu's on-stack
replacement API can gracefully support dynamic languages such as
JavaScript, and it is implementable on concrete hardware. Our
RPython
client has been able to translate and execute non-trivial
RPython
programs, and can run the RPySOM interpreter and the core of the
PyPy
interpreter.
With micro virtual machines providing a low-level substrate,
language
developers now have the option to build their next language on a
micro
virtual machine. We believe that the quality of programming
languages
will be improved as a result
Quantifying and Predicting the Influence of Execution Platform on Software Component Performance
The performance of software components depends on several factors, including the execution platform on which the software components run. To simplify cross-platform performance prediction in relocation and sizing scenarios, a novel approach is introduced in this thesis which separates the application performance profile from the platform performance profile. The approach is evaluated using transparent instrumentation of Java applications and with automated benchmarks for Java Virtual Machines
Speeding up dynamic compilation: concurrent and parallel dynamic compilation
The main challenge faced by a dynamic compilation system is to detect and
translate frequently executed program regions into highly efficient native code
as fast as possible. To efficiently reduce dynamic compilation latency, a dynamic
compilation system must improve its workload throughput, i.e. compile
more application hotspots per time. As time for dynamic compilation
adds to the overall execution time, the dynamic compiler is often decoupled
and operates in a separate thread independent from the main execution loop
to reduce the overhead of dynamic compilation.
This thesis proposes innovative techniques aimed at effectively speeding
up dynamic compilation. The first contribution is a generalised region
recording scheme optimised for program representations that require dynamic
code discovery (e.g. binary program representations). The second contribution
reduces dynamic compilation cost by incrementally compiling several
hot regions in a concurrent and parallel task farm. Altogether the combination
of generalised light-weight code discovery, large translation units,
dynamic work scheduling, and concurrent and parallel dynamic compilation
ensures timely and efficient processing of compilation workloads. Compared
to state-of-the-art dynamic compilation approaches, speedups of up to 2.08
are demonstrated for industry standard benchmarks such as BioPerf, Spec
Cpu 2006, and Eembc.
Next, innovative applications of the proposed dynamic compilation scheme
to speed up architectural and micro-architectural performance modelling are
demonstrated. The main contribution in this context is to exploit runtime
information to dynamically generate optimised code that accurately models
architectural and micro-architectural components. Consequently, compilation
units are larger and more complex resulting in increased compilation
latencies. Large and complex compilation units present an ideal use case for
our concurrent and parallel dynamic compilation infrastructure. We demonstrate
that our novel micro-architectural performance modelling is faster than
state-of-the-art Fpga-based simulation, whilst providing the same level of
accuracy
Simulations parallèles de Monte Carlo appliquées à la Physique des Hautes Energies pour plates-formes manycore et multicore : mise au point, optimisation, reproductibilité
During this thesis, we focused on High Performance Computing, specifically on Monte Carlo simulations applied to High Energy Physics. We worked on simulations dedicated to the propagation of particles through matter. Monte Carlo simulations require significant CPU time and memory footprint.Our first Monte Carlo simulation was taking more time to simulate the physical phenomenon than the said phenomenon required to happen in the experimental conditions. It raised a real performance issue. The minimal technical aim of the thesis was to have a simulation requiring as much time as the real observed phenomenon. Our maximal target was to have a much faster simulation. Indeed, these simulations are critical to asses our correct understanding of what is observed during experimentation. The more we have simulated statistics samples, the better are our results. This initial state of our simulation was allowing numerous perspectives regarding optimisation, and high performance computing. Furthermore, in our case, increasing the performance of the simulation was pointless if it was at the cost of losing results reproducibility. The numerical reproducibility of the simulation was then an aspect we had to take into account. In this manuscript, after a state of the art about profiling, optimisation and reproducibility, we proposed several strategies to gain more performance in our simulations. In each case, all the proposed optimisations followed a profiling step. One never optimises without having profiled first. Then, we looked at the design of a parallel profiler using aspect-oriented programming for our specific needs. Finally, we took a new look at the issues raised by our Monte Carlo simulations: instead of optimising existing simulations, we proposed methods for developing a new simulation from scratch, having in mind it is for High Performance Computing and it has to be statistically sound, reproducible and scalable. In all our proposals, we looked at both multicore and manycore architectures from Intel to benchmark the performance on server-oriented architecture and High Performance Computing oriented architecture.Through the implementation of our proposals, we were able to optimise one of the Monte Carlo simulations, permitting us to achieve a 400X speedup, once optimised and parallelised on a computing node with 32 physical cores. We were also able to implement a profiler with aspects, able to deal with the parallelism of its computer and of the application it profiles. Moreover, because it relies on aspects, it is portable and not tied to any specific architecture. Finally, we implemented the simulation designed to be reproducible, scalable and to have statistically sound results. We observed that these goals could be achieved, whatever the target architecture for execution. This enabled us to assess our method for validating the numerical reproducibility of a simulation.Lors de cette thèse, nous nous sommes focalisés sur le calcul à haute performance, dans le domaine très précis des simulations de Monte Carlo appliquées à la physique des hautes énergies, et plus particulièrement, aux simulations pour la propagation de particules dans un milieu. Les simulations de Monte Carlo sont des simulations particulièrement consommatrices en ressources, temps de calcul, capacité mémoire.Dans le cas précis sur lequel nous nous sommes penchés, la première simulation de Monte Carlo existante prenait plus de temps à simuler le phénomène physique que le phénomène lui-même n’en prenait pour se dérouler dans les conditions expérimentales. Cela posait donc un sévère problème de performance. L’objectif technique minimal était d’avoir une simulation prenant autant de temps que le phénomène réel observé, l’objectif maximal était d’avoir une simulation bien plus rapide. En effet, ces simulations sont importantes pour vérifier la bonne compréhension de ce qui est observé dans les conditions expérimentales. Plus nous disposons d’échantillons statistiques simulés, meilleurs sont les résultats. Cet état initial des simulations ouvrait donc de nombreuses perspectives d’un point de vue optimisation et calcul à haute performance. Par ailleurs, dans notre cas, le gain de performance étant proprement inutile s’il n’est pas accompagné d’une reproductibilité des résultats, la reproductibilité numérique de la simulation est de ce fait un aspect que nous devons prendre en compte.C’est ainsi que dans le cadre de cette thèse, après un état de l’art sur le profilage, l’optimisation et la reproductibilité, nous avons proposé plusieurs stratégies visant à obtenir plus de performances pour nos simulations. Dans tous les cas, les optimisations proposées étaient précédées d’un profilage. On n’optimise jamais sans avoir profilé. Par la suite, nous nous intéressés à la création d’un profileur parallèle en programmation orientée aspect pour nos besoins très spécifiques, enfin, nous avons considéré la problématique de nos simulations sous un angle nouveau : plutôt que d’optimiser une simulation existante, nous avons proposé des méthodes permettant d’en créer une nouvelle, très spécifique à notre domaine, qui soit d’emblée reproductible, statistiquement correcte et qui puisse passer à l’échelle. Dans toutes les propositions, de façon transverse, nous nous sommes intéressés aux architectures multicore et manycore d’Intel pour évaluer les performances à travers une architecture orientée serveur et une architecture orientée calcul à haute performance.Ainsi, grâce à la mise en application de nos propositions, nous avons pu optimiser une des simulations de Monte Carlo, nous permettant d’obtenir un gain de performance de l’ordre de 400X, une fois optimisée et parallélisée sur un nœud de calcul avec 32 cœurs physiques. De même, nous avons pu proposer l’implémentation d’un profileur, programmé à l’aide d’aspects et capable de gérer le parallélisme à la fois de la machine sur laquelle il est exécuté mais aussi de l’application qu’il profile. De plus, parce qu’il emploi les aspects, il est portable et n’est pas fixé à une architecture matérielle en particulier. Enfin, nous avons implémenté la simulation prévue pour être reproductible, performante et ayant des résultats statistiquement viables. Nous avons pu constater que ces objectifs étaient atteints quelle que soit l’architecture cible pour l’exécution. Cela nous a permis de valider notamment notre méthode de vérification de la reproductibilité numérique d’une simulation
Co-Evolution of Source Code and the Build System: Impact on the Introduction of AOSD in Legacy Systems
Software is omnipresent in our daily lives. As users demand ever more advanced features, software systems have to keep on evolving. In practice, this means that software developers need to adapt the description of a software application. Such a description not only consists of source code written down in a programming language, as a lot of knowledge is hidden in lesser known software development artifacts, like the build system. As its name suggests, the build system is responsible for building an executable program, ready for use, from the source code. There are various indications that the evolution of source code is strongly related to that of the build system. When the source code changes, the build system has to co-evolve to safeguard the ability to build an executable program. A rigid build system on the other hand limits software developers. This phenomenon especially surfaces when drastic changes in the source code are coupled with an inflexible build system, as is the case for the introduction of AOSD technology in legacy systems. AOSD is a young software development approach which enables developers to structure and compose source code in a better way. Legacy systems are old software systems which are still mission-critical, but of which the source code and the build system are no longer fully understood, and which typically make use of old(-fashioned) technology. This PhD dissertation focuses on finding an explanation for this co-evolution of source code and the build system, and on finding developer support to grasp and manage this phenomenon. We postulate four "roots of co-evolution" which represent four different ways in which source code and the build system interact with each other. Based on these roots, we have developed tool and aspect language support to understand and manage co-evolution. The roots and the tool support have been validated in case studies, both in the context of co-evolution in general and of the introduction of AOSD technology in legacy systems. The dissertation experimentally shows that co-evolution indeed is a real problem, but that specific software development and aspect language support enables developers to deal with it