168 research outputs found
Recommended from our members
Automatically accelerating non-numerical programs by architecture-compiler co-design
Because of the high cost of communication between processors, compilers that parallelize loops automatically have been forced to skip a large class of loops that are both critical to performance and rich in latent parallelism. HELIX-RC is a compiler/microprocessor co-design that opens those loops to parallelization by decoupling communication from thread execution in conventional multicore architecures. Simulations of HELIX-RC, applied to a processor with 16 Intel Atom-like cores, show an average of 6.85× performance speedup for six SPEC CINT2000 benchmarks.</jats:p
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version
Architectural and Complier Mechanisms for Accelerating Single Thread Applications on Mulitcore Processors.
Multicore systems have become the dominant mainstream computing platform. One of the biggest challenges going forward is how to efficiently utilize the ever increasing computational power provided by multicore systems. Applications with large amounts of explicit thread-level parallelism naturally scale performance with the number of cores. However, single-thread applications realize little to no gains from multicore systems.
This work investigates architectural and compiler mechanisms to automatically accelerate single thread applications on multicore processors by efficiently exploiting three types of parallelism across multiple cores: instruction level parallelism (ILP), fine-grain thread level parallelism (TLP), and speculative loop level parallelism (LLP).
A multicore architecture called Voltron is proposed to exploit different types of parallelism. Voltron can organize the cores for execution in either coupled or decoupled mode. In coupled mode, several in-order cores are coalesced to emulate a wide-issue VLIW processor. In decoupled mode, the cores execute a set of fine-grain communicating threads extracted by the compiler. By executing fine-grain threads in parallel, Voltron provides coarse-grained out-of-order execution capability using in-order cores. Architectural mechanisms for speculative execution of loop iterations are also supported under the decoupled mode. Voltron can dynamically switch between two modes with low overhead to exploit the best form of available parallelism.
This dissertation also investigates compiler techniques to exploit different types of parallelism on the proposed architecture. First, this work proposes compiler techniques to manage multiple instruction streams to collectively function as a single logical stream on a conventional VLIW to exploit ILP. Second, this work studies compiler algorithms to extract fine-grain threads. Third, this dissertation proposes a series of systematic compiler transformations and a general code generation framework to expose hidden speculative LLP hindered by register and memory dependences in the code. These transformations collectively remove inter-iteration dependences that are caused by subsets of isolatable instructions, are unwindable, or occur infrequently.
Experimental results show that proposed mechanisms can achieve speedups of 1.33 and 1.14 on 4 core machines by exploiting ILP and TLP respectively. The proposed transformations increase the DOALL loop coverage in applications from 27% to 61%, resulting in a speedup of 1.84 on 4 core systems.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/58419/1/hongtaoz_1.pd
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
Indexed dependence metadata and its applications in software performance optimisation
To achieve continued performance improvements, modern microprocessor design is tending to concentrate
an increasing proportion of hardware on computation units with less automatic management
of data movement and extraction of parallelism. As a result, architectures increasingly include multiple
computation cores and complicated, software-managed memory hierarchies. Compilers have
difficulty characterizing the behaviour of a kernel in a general enough manner to enable automatic
generation of efficient code in any but the most straightforward of cases.
We propose the concept of indexed dependence metadata to improve application development and
mapping onto such architectures. The metadata represent both the iteration space of a kernel and the
mapping of that iteration space from a given index to the set of data elements that iteration might
use: thus the dependence metadata is indexed by the kernel’s iteration space. This explicit mapping
allows the compiler or runtime to optimise the program more efficiently, and improves the program
structure for the developer. We argue that this form of explicit interface specification reduces the need
for premature, architecture-specific optimisation. It improves program portability, supports intercomponent
optimisation and enables generation of efficient data movement code.
We offer the following contributions: an introduction to the concept of indexed dependence metadata
as a generalisation of stream programming, a demonstration of its advantages in a component
programming system, the decoupled access/execute model for C++ programs, and how indexed dependence
metadata might be used to improve the programming model for GPU-based designs. Our
experimental results with prototype implementations show that indexed dependence metadata supports
automatic synthesis of double-buffered data movement for the Cell processor and enables aggressive
loop fusion optimisations in image processing, linear algebra and multigrid application case
studies
Efficient runtime systems for speculative parallelization
Manuelle Parallelisierung ist zeitaufwändig und fehleranfällig. Automatische Parallelisierung andererseits findet häufig nur einen Bruchteil der verfügbaren Parallelität. Mithilfe von Spekulation kann jedoch auch für komplexere Programme ein Großteil der Parallelität ausgenutzt werden. Spekulativ parallelisierte Programme benötigen zur Ausführung immer ein Laufzeitsystem, um die spekulativen Annahmen abzusichern und für den Fall des Nichtzutreffens die korrekte Ausführungssemantik sicherzustellen. Solche Laufzeitsysteme sollen die Ausführungszeit des parallelen Programms so wenig wie möglich beeinflussen. In dieser Arbeit untersuchen wir, inwiefern aktuelle Systeme, die Speicherzugriffe explizit und in Software beobachten, diese Anforderung erfüllen, und stellen Änderungen vor, die die Laufzeit massiv verbessern. Außerdem entwerfen wir zwei neue Systeme, die mithilfe von virtueller Speicherverwaltung das Programm indirekt beobachten und dadurch eine deutlich geringere Auswirkung auf die Laufzeit haben. Eines der vorgestellten Systeme ist mittels eines Moduls direkt in den Linux-Betriebssystemkern integriert und bietet so die bestmögliche Effizienz. Darüber hinaus bietet es weitreichendere Sicherheitsgarantien als alle bisherigen Techniken, indem sogar Systemaufrufe zum Beispiel zur Datei Ein- und Ausgabe in der spekulativen Isolation mit eingeschlossen sind. Wir zeigen an einer Reihe von Benchmarks die Überlegenheit unserer Spekulationssyteme über den derzeitigen Stand der Technik. Sämtliche unserer Erweiterungen und Neuentwicklungen stehen als open source zur freien Verfügung. Diese Arbeit ist in englischer Sprache verfasst.Manual parallelization is time consuming and error-prone. Automatic parallelization on the other hand is often unable to extract substantial parallelism. Using speculation, however, most of the parallelism can be exploited even of complex programs. Speculatively parallelized programs always need a runtime system during execution in order to ensure the validity of the speculative assumptions, and to ensure the correct semantics even in the case of misspeculation. These runtime systems should influence the execution time of the parallel program as little as possible. In this thesis, we investigate to which extend state-of-the-art systems which track memory accesses explicitly in software fulfill this requirement. We describe and implement changes which improve their performance substantially. We also design two new systems utilizing virtual memory abstraction to track memory changed implicitly, thus causing less overhead during execution. One of the new systems is integrated into the Linux kernel as a kernel module, providing the best possible performance. Furthermore it provides stronger soundness guarantees than any state-of-the-art system by also capturing system calls, hence including for example file I/O into speculative isolation. In a number of benchmarks we show the performance improvements of our virtual memory based systems over the state of the art. All our extensions and newly developed speculation systems are made available as open source
Automated detection of structured coarse-grained parallelism in sequential legacy applications
The efficient execution of sequential legacy applications on modern, parallel computer
architectures is one of today’s most pressing problems. Automatic parallelization
has been investigated as a potential solution for several decades but its success
generally remains restricted to small niches of regular, array-based applications.
This thesis investigates two techniques that have the potential to overcome these
limitations.
Beginning at the lowest level of abstraction, the binary executable, it presents
a study of the limits of Dynamic Binary Parallelization (Dbp), a recently proposed
technique that takes advantage of an underlying multicore host to transparently
parallelize a sequential binary executable. While still in its infancy, Dbp has received
broad interest within the research community. This thesis seeks to gain an
understanding of the factors contributing to the limits of Dbp and the costs and
overheads of its implementation. An extensive evaluation using a parameterizable
Dbp system targeting a Cmp with light-weight architectural Tls support is presented.
The results show that there is room for a significant reduction of up to 54%
in the number of instructions on the critical paths of legacy Spec Cpu2006 benchmarks,
but that it is much harder to translate these savings into actual performance
improvements, with a realistic hardware-supported implementation achieving a
speedup of 1.09 on average.
While automatically parallelizing compilers have traditionally focused on data
parallelism, additional parallelism exists in a plethora of other shapes such as task
farms, divide & conquer, map/reduce and many more. These algorithmic skeletons,
i.e. high-level abstractions for commonly used patterns of parallel computation,
differ substantially from data parallel loops. Unfortunately, algorithmic skeletons
are largely informal programming abstractions and are lacking a formal characterization
in terms of established compiler concepts. This thesis develops compiler-friendly
characterizations of popular algorithmic skeletons using a novel notion of
commutativity based on liveness. A hybrid static/dynamic analysis framework for
the context-sensitive detection of skeletons in legacy code that overcomes limitations
of static analysis by complementing it with profiling information is described.
A proof-of-concept implementation of this framework in the Llvm compiler infrastructure
is evaluated against Spec Cpu2006 benchmarks for the detection of a typical skeleton. The results illustrate that skeletons are often context-sensitive in
nature.
Like the two approaches presented in this thesis, many dynamic parallelization
techniques exploit the fact that some statically detected data and control
flow dependences do not manifest themselves in every possible program execution
(may-dependences) but occur only infrequently, e.g. for some corner cases, or not
at all for any legal program input. While the effectiveness of dynamic parallelization
techniques critically depends on the absence of such dependences, not much
is known about their nature. This thesis presents an empirical analysis and characterization
of the variability of both data dependences and control flow across
program runs. The cBench benchmark suite is run with 100 randomly chosen
input data sets to generate whole-program control and data flow graphs (Cdfgs)
for each run, which are then compared to obtain a measure of the variance in the
observed control and data flow. The results show that, on average, the cumulative
profile information gathered with at least 55, and up to 100, different input data
sets is needed to achieve full coverage of the data flow observed across all runs.
For control flow, the figure stands at 46 and 100 data sets, respectively. This suggests
that profile-guided parallelization needs to be applied with utmost care, as
misclassification of sequential loops as parallel was observed even when up to 94
input data sets are used
- …