14 research outputs found
Optimizing indirect branch prediction accuracy in virtual machine interpreters
Interpreters designed for efficiency execute a huge number of indirect branches and can spend more than half of the execution time in indirect branch mispredictions. Branch target buffers are the best widely available form of indirect branch prediction; however, their prediction accuracy for existing interpreters is only 2%–50%. In this paper we investigate two methods for improving the prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and combining sequences of VM instructions into superinstructions. We investigate static (interpreter buildtime) and dynamic (interpreter run-time) variants of these techniques and compare them and several combinations of these techniques. These techniques can eliminate nearly all of the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of up to 3.17 over efficient threaded-code interpreters, and speedups by a factor of up to 1.3 over techniques relying on superinstructions alone
Virtual Machine Support for Many-Core Architectures: Decoupling Abstract from Concrete Concurrency Models
The upcoming many-core architectures require software developers to exploit
concurrency to utilize available computational power. Today's high-level
language virtual machines (VMs), which are a cornerstone of software
development, do not provide sufficient abstraction for concurrency concepts. We
analyze concrete and abstract concurrency models and identify the challenges
they impose for VMs. To provide sufficient concurrency support in VMs, we
propose to integrate concurrency operations into VM instruction sets.
Since there will always be VMs optimized for special purposes, our goal is to
develop a methodology to design instruction sets with concurrency support.
Therefore, we also propose a list of trade-offs that have to be investigated to
advise the design of such instruction sets.
As a first experiment, we implemented one instruction set extension for
shared memory and one for non-shared memory concurrency. From our experimental
results, we derived a list of requirements for a full-grown experimental
environment for further research
Repositioning Tiered HotSpot Execution Performance Relative to the Interpreter
Although the advantages of just-in-time compilation over traditional
interpretive execution are widely recognised, there needs to be more current
research investigating and repositioning the performance differences between
these two execution models relative to contemporary workloads. Specifically,
there is a need to examine the performance differences between Java Runtime
Environment (JRE) Java Virtual Machine (JVM) tiered execution and JRE JVM
interpretive execution relative to modern multicore architectures and modern
concurrent and parallel benchmark workloads. This article aims to fill this
research gap by presenting the results of a study that compares the performance
of these two execution models under load from the Renaissance Benchmark Suite.
This research is relevant to anyone interested in understanding the performance
differences between just-in-time compiled code and interpretive execution. It
provides a contemporary assessment of the interpretive JVM core, the entry and
starting point for bytecode execution, relative to just-in-time tiered
execution. The study considers factors such as the JRE version, the GNU GCC
version used in the JRE build toolchain, and the garbage collector algorithm
specified at runtime, and their impact on the performance difference envelope
between interpretive and tiered execution. Our findings indicate that tiered
execution is considerably more efficient than interpretive execution, and the
performance gap has increased, ranging from 4 to 37 times more efficient. On
average, tiered execution is approximately 15 times more efficient than
interpretive execution. Additionally, the performance differences between
interpretive and tiered execution are influenced by workload category, with
narrower performance differences observed for web-based workloads and more
significant differences for Functional and Scala-type workloads.Comment: 17 page
The Effect of Instruction Padding on SFI Overhead
Software-based fault isolation (SFI) is a technique to isolate a potentially
faulty or malicious software module from the rest of a system using
instruction-level rewriting. SFI implementations on CISC architectures,
including Google Native Client, use instruction padding to enforce an address
layout invariant and restrict control flow. However this padding decreases code
density and imposes runtime overhead. We analyze this overhead, and show that
it can be reduced by allowing some execution of overlapping instructions, as
long as those overlapping instructions are still safe according to the original
per-instruction policy. We implemented this change for both 32-bit and 64-bit
x86 versions of Native Client, and analyzed why the performance benefit is
higher on 32-bit. The optimization leads to a consistent decrease in the number
of instructions executed and savings averaging 8.6% in execution time (over
compatible benchmarks from SPECint2006) for x86-32. We describe how to modify
the validation algorithm to check the more permissive policy, and extend a
machine-checked Coq proof to confirm that the system's security is preserved.Comment: NDSS Workshop on Binary Analysis Research, February 201
Optimizing Indirect Branch Prediction Accuracy in Virtual Machine Interpreters
Interpreters designed for efficiency execute a huge number of indirect branches and can spend more
than half of the execution time in indirect branch mispredictions. Branch target buffers (BTBs) are
the most widely available form of indirect branch prediction; however, their prediction accuracy for
existing interpreters is only 2%–50%. In this paper we investigate two methods for improving the
prediction accuracy of BTBs for interpreters: replicating virtual machine (VM) instructions and
combining sequences of VM instructions into superinstructions. We investigate static (interpreter
build-time) and dynamic (interpreter run-time) variants of these techniques and compare them
and several combinations of these techniques. To show their generality, we have implemented
these optimizations in VMs for both Java and Forth. These techniques can eliminate nearly all of
the dispatch branch mispredictions, and have other benefits, resulting in speedups by a factor of
up to 4.55 over efficient threaded-code interpreters, and speedups by a factor of up to 1.34 over
techniques relying on dynamic superinstructions alone