565 research outputs found
Recommended from our members
Duplo: A framework for OCaml post-link optimisation
We present a novel framework,
Duplo
, for the low-level post-link optimisation of OCaml programs, achieving a speedup of 7% and a reduction of at least 15% of the code size of widely-used OCaml applications. Unlike existing post-link optimisers, which typically operate on target-specific machine code, our framework operates on a Low-Level Intermediate Representation (LLIR) capable of representing both the OCaml programs and any C dependencies they invoke through the foreign-function interface (FFI). LLIR is analysed, transformed and lowered to machine code by our post-link optimiser, LLIR-OPT. Most importantly, LLIR allows the optimiser to cross the OCaml-C language boundary, mitigating the overhead incurred by the FFI and enabling analyses and transformations in a previously unavailable context. The optimised IR is then lowered to amd64 machine code through the existing target-specific code generator of LLVM, modified to handle garbage collection just as effectively as the native OCaml backend. We equip our optimiser with a suite of SSA-based transformations and points-to analyses capable of capturing the semantics and representing the memory models of both languages, along with a cross-language inliner to embed C methods into OCaml callers. We evaluate the gains of our framework, which can be attributed to both our optimiser and the more sophisticated amd64 backend of LLVM, on a wide-range of widely-used OCaml applications, as well as an existing suite of micro- and macro-benchmarks used to track the performance of the OCaml compiler.
EPSRC EP/P020011/1, Cambridge Trust
Distilling the Real Cost of Production Garbage Collectors
Abridged abstract: despite the long history of garbage collection (GC) and
its prevalence in modern programming languages, there is surprisingly little
clarity about its true cost. Without understanding their cost, crucial
tradeoffs made by garbage collectors (GCs) go unnoticed. This can lead to
misguided design constraints and evaluation criteria used by GC researchers and
users, hindering the development of high-performance, low-cost GCs. In this
paper, we develop a methodology that allows us to empirically estimate the cost
of GC for any given set of metrics. By distilling out the explicitly
identifiable GC cost, we estimate the intrinsic application execution cost
using different GCs. The minimum distilled cost forms a baseline. Subtracting
this baseline from the total execution costs, we can then place an empirical
lower bound on the absolute costs of different GCs. Using this methodology, we
study five production GCs in OpenJDK 17, a high-performance Java runtime. We
measure the cost of these collectors, and expose their respective key
performance tradeoffs. We find that with a modestly sized heap, production GCs
incur substantial overheads across a diverse suite of modern benchmarks,
spending at least 7-82% more wall-clock time and 6-92% more CPU cycles relative
to the baseline cost. We show that these costs can be masked by concurrency and
generous provisioning of memory/compute. In addition, we find that newer
low-pause GCs are significantly more expensive than older GCs, and,
surprisingly, sometimes deliver worse application latency than stop-the-world
GCs. Our findings reaffirm that GC is by no means a solved problem and that a
low-cost, low-latency GC remains elusive. We recommend adopting the
distillation methodology together with a wider range of cost metrics for future
GC evaluations.Comment: Camera-ready versio
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
VMKit: a Substrate for Virtual Machines
Developing and optimizing a virtual machine (VM) is a tedious task that requires many years of development. Although VMs share some common principles, such as a Just In Time Compiler or a Garbage Collector, this opportunity for sharing hash not been yet exploited in implementing VMs. This paper describes and evaluates VMKit, a first attempt to build a common substrate that eases the development of high-level VMs. VMKit has been successfully used to build three VMs: a Java Virtual Machine, a Common Language Runtime and a lisp-like language with type inference uvml. Our precise contribution is an extensive study of the lessons learnt in implementing such common infrastructure from a performance and an ease of development standpoint. Our performance study shows that VMKit does not degrade performance on CPU-intensive applications, but still requires engineering efforts to compete with other VMs on memory intensive applications. Our high level VMs are only 20,000 lines of code, it took one of the author a month to develop a Common Language Runtime and implementing new ideas in the VMs was remarkably easy
Predictability of just in time compilation
The productivity of embedded software development is limited by the high fragmentation of hardware platforms. To alleviate this problem, virtualization has become an important tool in computer science; and virtual machines are used in a number of subdisciplines ranging from operating systems to processor architecture. The processor virtualization can be used to address the portability problem. While the traditional compilation flow consists of compiling program source code into binary objects that can natively executed on a given processor, processor virtualization splits that flow in two parts: the first part consists of compiling the program source code into processor-independent bytecode representation; the second part provides an execution platform that can run this bytecode in a given processor. The second part is done by a virtual machine interpreting the bytecode or by just-in-time (JIT) compiling the bytecodes of a method at run-time in order to improve the execution performance. Many applications feature real-time system requirements. The success of real-time systems relies upon their capability of producing functionally correct results within dened timing constraints. To validate these constraints, most scheduling algorithms assume that the worstcase execution time (WCET) estimation of each task is already known. The WCET of a task is the longest time it takes when it is considered in isolation. Sophisticated techniques are used in static WCET estimation (e.g. to model caches) to achieve both safe and tight estimation. Our work aims at recombining the two domains, i.e. using the JIT compilation in realtime systems. This is an ambitious goal which requires introducing the deterministic in many non-deterministic features, e.g. bound the compilation time and the overhead caused by the dynamic management of the compiled code cache, etc. Due to the limited time of the internship, this report represents a rst attempt to such combination. To obtain the WCET of a program, we have to add the compilation time to the execution time because the two phases are now mixed. Therefore, one needs to know statically how many times in the worst case a function will be compiled. It may be seemed a simple job, but if we consider a resource constraint as the limited memory size and the advanced techniques used in JIT compilation, things will be nasty. We suppose that a function is compiled at the rst time it is used, and its compiled code is cached in limited size software cache. Our objective is to find an appropriate structure cache and replacement policy which reduce the overhead of compilation in the worst case
Resource accounting and reservation in Java Virtual Machine
The Java platform The Java programming language was designed to developed small application for embedded devices, but it was a long time ago. Today, Java application are running in many platforms ranging from smartphones to enterprise servers. Modern pervasive middleware is typically implemented using Java because of its safety, flexibility, and mature development environment. However, the Java virtual machine specification has not had a major revised since 1999 Several researches had addressed these important issues. As result of these efforts some Java specification requests (JSR) have emerged. We consider there are seven JSRs relate to monitoring and to resource accounting and reservation: three for the Java Management eXtension API (JSRs 3, 160, 255), two for Metric Instrumentation (JSRs 138, 174) and two for resource-consumption management (JSRs 121, 284). The Java Management extension API only addresses monitoring and management: it does not define specific resource accounting or reservation strategies. JSRs 138 and 174 define monitors for the Java Virtual Machine. They are coarse grained, monitoring the number of running threads, the memory used, the number of garbage collections and so on. They monitor the entire virtual machine, not specific component so, they are useless to most middleware. Based on the Multitasking Virtual Machin
- …