8 research outputs found
Automated detection of structured coarse-grained parallelism in sequential legacy applications
The efficient execution of sequential legacy applications on modern, parallel computer
architectures is one of today鈥檚 most pressing problems. Automatic parallelization
has been investigated as a potential solution for several decades but its success
generally remains restricted to small niches of regular, array-based applications.
This thesis investigates two techniques that have the potential to overcome these
limitations.
Beginning at the lowest level of abstraction, the binary executable, it presents
a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed
technique that takes advantage of an underlying multicore host to transparently
parallelize a sequential binary executable. While still in its infancy, Dbp has received
broad interest within the research community. This thesis seeks to gain an
understanding of the factors contributing to the limits of Dbp and the costs and
overheads of its implementation. An extensive evaluation using a parameterizable
Dbp system targeting a Cmp with light-weight architectural Tls support is presented.
The results show that there is room for a significant reduction of up to 54%
in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks,
but that it is much harder to translate these savings into actual performance
improvements, with a realistic hardware-supported implementation achieving a
speedup of 1.09 on average.
While automatically parallelizing compilers have traditionally focused on data
parallelism, additional parallelism exists in a plethora of other shapes such as task
farms, divide & conquer, map/reduce and many more. These algorithmic skeletons,
i.e. high-level abstractions for commonly used patterns of parallel computation,
differ substantially from data parallel loops. Unfortunately, algorithmic skeletons
are largely informal programming abstractions and are lacking a formal characterization
in terms of established compiler concepts. This thesis develops compiler-friendly
characterizations of popular algorithmic skeletons using a novel notion of
commutativity based on liveness. A hybrid static/dynamic analysis framework for
the context-sensitive detection of skeletons in legacy code that overcomes limitations
of static analysis by complementing it with profiling information is described.
A proof-of-concept implementation of this framework in the Llvm compiler infrastructure
is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in
nature.
Like the two approaches presented in this thesis, many dynamic parallelization
techniques exploit the fact that some statically detected data and control
flow dependences do not manifest themselves in every possible program execution
(may-dependences) but occur only infrequently, e.g. for some corner cases, or not
at all for any legal program input. While the effectiveness of dynamic parallelization
techniques critically depends on the absence of such dependences, not much
is known about their nature. This thesis presents an empirical analysis and characterization
of the variability of both data dependences and control flow across
program runs. The cBench benchmark suite is run with 100 randomly chosen
input data sets to generate whole-program control and data flow graphs (Cdfgs)
for each run, which are then compared to obtain a measure of the variance in the
observed control and data flow. The results show that, on average, the cumulative
profile information gathered with at least 55, and up to 100, different input data
sets is needed to achieve full coverage of the data flow observed across all runs.
For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests
that profile-guided parallelization needs to be applied with utmost care, as
misclassification of sequential loops as parallel was observed even when up to 94
input data sets are used
Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator
Dynamic Binary Translation (DBT) is the key technology behind cross-platform virtualization and allows software compiled for one Instruction Set Architecture (ISA) to be executed on a processor supporting a different ISA. Under the hood, DBT is typically implemented using Just-In-Time (JIT) compilation of frequently executed program regions, also called traces. The main challenge is translating frequently executed program regions as fast as possible into highly efficient native code. As time for JIT compilation adds to the overall execution time, the JIT compiler is often decoupled and operates in a separate thread independent from the main simulation loop to reduce the overhead of JIT compilation. In this paper we present two innovative contributions. The first contribution is a generalized trace compilation approach that considers all frequently executed paths in a program for JIT compilation
Generalized just-in-time trace compilation using a parallel task farm in a dynamic binary translator
A Parallel Dynamic Binary Translator for Efficient Multi-Core Simulation
Abstract In recent years multi-core processors have seen broad adoption in application domains ranging from embedded systems through general-purpose computing to large-scale data centres. Simulation technology for multi-core systems, however, lags behind and does not provide the simulation speed required to effectively support design space exploration and parallel software development. While state-of-the-art instruction set simulators (Iss) for single-core machines reach or exceed the performance levels of speed-optimised silicon implementations of embedded processors, the same does not hold for multi-core simulators where large performance penalties are to be paid. In this paper we develop a fast and scalable simulation methodology for multi-core platforms based on parallel and just-in-time (Jit) dynamic binary translation (Dbt). Our approach can model large-scale multi-core configurations, does not rely on prior profiling, instrumentation, or compilation, and works for all binaries targeting a state-of-the-art embedded multi-core platform implementing the Arcompact instruction set architecture (Isa). We have evaluated our parallel simulation methodology against the industry standard Splash-2 and Eembc MultiBench benchmarks and demonstrate simulation speeds up to 25,307 Mips on a 32-core x86 host machine for as many as 2,048 target processors whilst exhibiting minimal and near constant overhead, including memory considerations