Search CORE

16 research outputs found

Recommended from our members

Guided Automatic Binary Parallelisation

Author: ZHOU RUOYU
Publication venue: University of Cambridge
Publication date: 06/04/2018
Field of study

For decades, the software industry has amassed a vast repository of pre-compiled libraries and executables which are still valuable and actively in use. However, for a significant fraction of these binaries, most of the source code is absent or is written in old languages, making it practically impossible to recompile them for new generations of hardware. As the number of cores in chip multi-processors (CMPs) continue to scale, the performance of this legacy software becomes increasingly sub-optimal. Rewriting new optimised and parallel software would be a time-consuming and expensive task. Without source code, existing automatic performance enhancing and parallelisation techniques are not applicable for legacy software or parts of new applications linked with legacy libraries. In this dissertation, three tools are presented to address the challenge of optimising legacy binaries. The first, GBR (Guided Binary Recompilation), is a tool that recompiles stripped application binaries without the need for the source code or relocation information. GBR performs static binary analysis to determine how recompilation should be undertaken, and produces a domain-specific hint program. This hint program is loaded and interpreted by the GBR dynamic runtime, which is built on top of the open-source dynamic binary translator, DynamoRIO. In this manner, complicated recompilation of the target binary is carried out to achieve optimised execution on a real system. The problem of limited dataflow and type information is addressed through cooperation between the hint program and JIT optimisation. The utility of GBR is demonstrated by software prefetch and vectorisation optimisations to achieve performance improvements compared to their original native execution. The second tool is called BEEP (Binary Emulator for Estimating Parallelism), an extension to GBR for binary instrumentation. BEEP is used to identify potential thread-level parallelism through static binary analysis and binary instrumentation. BEEP performs preliminary static analysis on binaries and encodes all statically-undecided questions into a hint program. The hint program is interpreted by GBR so that on-demand binary instrumentation codes are inserted to answer the questions from runtime information. BEEP incorporates a few parallel cost models to evaluate identified parallelism under different parallelisation paradigms. The third tool is named GABP (Guided Automatic Binary Parallelisation), an extension to GBR for parallelisation. GABP focuses on loops from sequential application binaries and automatically extracts thread-level parallelism from them on-the-fly, under the direction of the hint program, for efficient parallel execution. It employs a range of runtime schemes, such as thread-level speculation and synchronisation, to handle runtime data dependences. GABP achieves a geometric mean of speedup of 1.91x on binaries from SPEC CPU2006 on a real x86-64 eight-core system compared to native sequential execution. Performance is obtained for SPEC CPU2006 executables compiled from a variety of source languages and by different compilers.St John's Benefactor Scholarship ARM Sponsorshi

Apollo (Cambridge)

Sulong-OpenMP: Implementation with Sulong and Evaluation

Author: Gaikwad Swapnil
Publication venue
Publication date: 31/12/2020
Field of study

The University of Manchester - Institutional Repository

Micro Virtual Machines: A Solid Foundation for Managed Language Implementation

Author: Wang Kunshan
Publication venue
Publication date: 01/01/2017
Field of study

Today new programming languages proliferate, but many of them suffer from poor performance and inscrutable semantics. We assert that the root of many of the performance and semantic problems of today's languages is that language implementation is extremely difficult. This thesis addresses the fundamental challenges of efficiently developing high-level managed languages. Modern high-level languages provide abstractions over execution, memory management and concurrency. It requires enormous intellectual capability and engineering effort to properly manage these concerns. Lacking such resources, developers usually choose naive implementation approaches in the early stages of language design, a strategy which too often has long-term consequences, hindering the future development of the language. Existing language development platforms have failed to provide the right level of abstraction, and forced implementers to reinvent low-level mechanisms in order to obtain performance. My thesis is that the introduction of micro virtual machines will allow the development of higher-quality, high-performance managed languages. The first contribution of this thesis is the design of Mu, with the specification of Mu as the main outcome. Mu is the first micro virtual machine, a robust, performant, and light-weight abstraction over just three concerns: execution, concurrency and garbage collection. Such a foundation attacks three of the most fundamental and challenging issues that face existing language designs and implementations, leaving the language implementers free to focus on the higher levels of their language design. The second contribution is an in-depth analysis of on-stack replacement and its efficient implementation. This low-level mechanism underpins run-time feedback-directed optimisation, which is key to the efficient implementation of dynamic languages. The third contribution is demonstrating the viability of Mu through RPython, a real-world non-trivial language implementation. We also did some preliminary research of GHC as a Mu client. We have created the Mu specification and its reference implementation, both of which are open-source. We show that that Mu's on-stack replacement API can gracefully support dynamic languages such as JavaScript, and it is implementable on concrete hardware. Our RPython client has been able to translate and execute non-trivial RPython programs, and can run the RPySOM interpreter and the core of the PyPy interpreter. With micro virtual machines providing a low-level substrate, language developers now have the option to build their next language on a micro virtual machine. We believe that the quality of programming languages will be improved as a result

The Australian National University

Software/Hardware Co-Design and Co-Specialisation: Novel Simulation Techniques and Optimisations

Author: Rodchenko Andrey
Publication venue
Publication date: 01/08/2018
Field of study

The University of Manchester - Institutional Repository

Quantifying and Predicting the Influence of Execution Platform on Software Component Performance

Author: Kuperberg Michael
Publication venue: KIT Scientific Publishing, Karlsruhe
Publication date: 01/01/2010
Field of study

The performance of software components depends on several factors, including the execution platform on which the software components run. To simplify cross-platform performance prediction in relocation and sizing scenarios, a novel approach is introduced in this thesis which separates the application performance profile from the platform performance profile. The approach is evaluated using transparent instrumentation of Java applications and with automated benchmarks for Java Virtual Machines

KITopen

Enabling Pipeline Parallelism in Heterogeneous Managed Runtime Environments via Batch Processing

Author: Blanaru Florin-Gabriel
Fumero Alfonso Juan
Kotselidis Christos-Efthymios
Stratikopoulos Athanasios
Publication venue
Publication date: 04/02/2022
Field of study

The University of Manchester - Institutional Repository

Speeding up dynamic compilation: concurrent and parallel dynamic compilation

Author: Bohm Igor
Publication venue: The University of Edinburgh
Publication date: 02/07/2013
Field of study

The main challenge faced by a dynamic compilation system is to detect and translate frequently executed program regions into highly efficient native code as fast as possible. To efficiently reduce dynamic compilation latency, a dynamic compilation system must improve its workload throughput, i.e. compile more application hotspots per time. As time for dynamic compilation adds to the overall execution time, the dynamic compiler is often decoupled and operates in a separate thread independent from the main execution loop to reduce the overhead of dynamic compilation. This thesis proposes innovative techniques aimed at effectively speeding up dynamic compilation. The first contribution is a generalised region recording scheme optimised for program representations that require dynamic code discovery (e.g. binary program representations). The second contribution reduces dynamic compilation cost by incrementally compiling several hot regions in a concurrent and parallel task farm. Altogether the combination of generalised light-weight code discovery, large translation units, dynamic work scheduling, and concurrent and parallel dynamic compilation ensures timely and efficient processing of compilation workloads. Compared to state-of-the-art dynamic compilation approaches, speedups of up to 2.08 are demonstrated for industry standard benchmarks such as BioPerf, Spec Cpu 2006, and Eembc. Next, innovative applications of the proposed dynamic compilation scheme to speed up architectural and micro-architectural performance modelling are demonstrated. The main contribution in this context is to exploit runtime information to dynamically generate optimised code that accurately models architectural and micro-architectural components. Consequently, compilation units are larger and more complex resulting in increased compilation latencies. Large and complex compilation units present an ideal use case for our concurrent and parallel dynamic compilation infrastructure. We demonstrate that our novel micro-architectural performance modelling is faster than state-of-the-art Fpga-based simulation, whilst providing the same level of accuracy

Edinburgh Research Archive

Simulations parallèles de Monte Carlo appliquées à la Physique des Hautes Energies pour plates-formes manycore et multicore : mise au point, optimisation, reproductibilité

Author: Schweitzer Pierre
Publication venue: HAL CCSD
Publication date: 19/10/2015
Field of study

During this thesis, we focused on High Performance Computing, specifically on Monte Carlo simulations applied to High Energy Physics. We worked on simulations dedicated to the propagation of particles through matter. Monte Carlo simulations require significant CPU time and memory footprint.Our first Monte Carlo simulation was taking more time to simulate the physical phenomenon than the said phenomenon required to happen in the experimental conditions. It raised a real performance issue. The minimal technical aim of the thesis was to have a simulation requiring as much time as the real observed phenomenon. Our maximal target was to have a much faster simulation. Indeed, these simulations are critical to asses our correct understanding of what is observed during experimentation. The more we have simulated statistics samples, the better are our results. This initial state of our simulation was allowing numerous perspectives regarding optimisation, and high performance computing. Furthermore, in our case, increasing the performance of the simulation was pointless if it was at the cost of losing results reproducibility. The numerical reproducibility of the simulation was then an aspect we had to take into account. In this manuscript, after a state of the art about profiling, optimisation and reproducibility, we proposed several strategies to gain more performance in our simulations. In each case, all the proposed optimisations followed a profiling step. One never optimises without having profiled first. Then, we looked at the design of a parallel profiler using aspect-oriented programming for our specific needs. Finally, we took a new look at the issues raised by our Monte Carlo simulations: instead of optimising existing simulations, we proposed methods for developing a new simulation from scratch, having in mind it is for High Performance Computing and it has to be statistically sound, reproducible and scalable. In all our proposals, we looked at both multicore and manycore architectures from Intel to benchmark the performance on server-oriented architecture and High Performance Computing oriented architecture.Through the implementation of our proposals, we were able to optimise one of the Monte Carlo simulations, permitting us to achieve a 400X speedup, once optimised and parallelised on a computing node with 32 physical cores. We were also able to implement a profiler with aspects, able to deal with the parallelism of its computer and of the application it profiles. Moreover, because it relies on aspects, it is portable and not tied to any specific architecture. Finally, we implemented the simulation designed to be reproducible, scalable and to have statistically sound results. We observed that these goals could be achieved, whatever the target architecture for execution. This enabled us to assess our method for validating the numerical reproducibility of a simulation.Lors de cette thèse, nous nous sommes focalisés sur le calcul à haute performance, dans le domaine très précis des simulations de Monte Carlo appliquées à la physique des hautes énergies, et plus particulièrement, aux simulations pour la propagation de particules dans un milieu. Les simulations de Monte Carlo sont des simulations particulièrement consommatrices en ressources, temps de calcul, capacité mémoire.Dans le cas précis sur lequel nous nous sommes penchés, la première simulation de Monte Carlo existante prenait plus de temps à simuler le phénomène physique que le phénomène lui-même n’en prenait pour se dérouler dans les conditions expérimentales. Cela posait donc un sévère problème de performance. L’objectif technique minimal était d’avoir une simulation prenant autant de temps que le phénomène réel observé, l’objectif maximal était d’avoir une simulation bien plus rapide. En effet, ces simulations sont importantes pour vérifier la bonne compréhension de ce qui est observé dans les conditions expérimentales. Plus nous disposons d’échantillons statistiques simulés, meilleurs sont les résultats. Cet état initial des simulations ouvrait donc de nombreuses perspectives d’un point de vue optimisation et calcul à haute performance. Par ailleurs, dans notre cas, le gain de performance étant proprement inutile s’il n’est pas accompagné d’une reproductibilité des résultats, la reproductibilité numérique de la simulation est de ce fait un aspect que nous devons prendre en compte.C’est ainsi que dans le cadre de cette thèse, après un état de l’art sur le profilage, l’optimisation et la reproductibilité, nous avons proposé plusieurs stratégies visant à obtenir plus de performances pour nos simulations. Dans tous les cas, les optimisations proposées étaient précédées d’un profilage. On n’optimise jamais sans avoir profilé. Par la suite, nous nous intéressés à la création d’un profileur parallèle en programmation orientée aspect pour nos besoins très spécifiques, enfin, nous avons considéré la problématique de nos simulations sous un angle nouveau : plutôt que d’optimiser une simulation existante, nous avons proposé des méthodes permettant d’en créer une nouvelle, très spécifique à notre domaine, qui soit d’emblée reproductible, statistiquement correcte et qui puisse passer à l’échelle. Dans toutes les propositions, de façon transverse, nous nous sommes intéressés aux architectures multicore et manycore d’Intel pour évaluer les performances à travers une architecture orientée serveur et une architecture orientée calcul à haute performance.Ainsi, grâce à la mise en application de nos propositions, nous avons pu optimiser une des simulations de Monte Carlo, nous permettant d’obtenir un gain de performance de l’ordre de 400X, une fois optimisée et parallélisée sur un nœud de calcul avec 32 cœurs physiques. De même, nous avons pu proposer l’implémentation d’un profileur, programmé à l’aide d’aspects et capable de gérer le parallélisme à la fois de la machine sur laquelle il est exécuté mais aussi de l’application qu’il profile. De plus, parce qu’il emploi les aspects, il est portable et n’est pas fixé à une architecture matérielle en particulier. Enfin, nous avons implémenté la simulation prévue pour être reproductible, performante et ayant des résultats statistiquement viables. Nous avons pu constater que ces objectifs étaient atteints quelle que soit l’architecture cible pour l’exécution. Cela nous a permis de valider notamment notre méthode de vérification de la reproductibilité numérique d’une simulation

Thèses en Ligne

HAL-IN2P3

HAL Clermont Université

Co-Evolution of Source Code and the Build System: Impact on the Introduction of AOSD in Legacy Systems

Author: Adams Bram
Publication venue
Publication date: 01/01/2008
Field of study

Software is omnipresent in our daily lives. As users demand ever more advanced features, software systems have to keep on evolving. In practice, this means that software developers need to adapt the description of a software application. Such a description not only consists of source code written down in a programming language, as a lot of knowledge is hidden in lesser known software development artifacts, like the build system. As its name suggests, the build system is responsible for building an executable program, ready for use, from the source code. There are various indications that the evolution of source code is strongly related to that of the build system. When the source code changes, the build system has to co-evolve to safeguard the ability to build an executable program. A rigid build system on the other hand limits software developers. This phenomenon especially surfaces when drastic changes in the source code are coupled with an inflexible build system, as is the case for the introduction of AOSD technology in legacy systems. AOSD is a young software development approach which enables developers to structure and compose source code in a better way. Legacy systems are old software systems which are still mission-critical, but of which the source code and the build system are no longer fully understood, and which typically make use of old(-fashioned) technology. This PhD dissertation focuses on finding an explanation for this co-evolution of source code and the build system, and on finding developer support to grasp and manage this phenomenon. We postulate four "roots of co-evolution" which represent four different ways in which source code and the build system interact with each other. Based on these roots, we have developed tool and aspect language support to understand and manage co-evolution. The roots and the tool support have been validated in case studies, both in the context of co-evolution in general and of the introduction of AOSD technology in legacy systems. The dissertation experimentally shows that co-evolution indeed is a real problem, but that specific software development and aspect language support enables developers to deal with it

Ghent University Academic Bibliography