981 research outputs found
The hArtes Tool Chain
This chapter describes the different design steps needed to go from legacy code to a transformed application that can be efficiently mapped on the hArtes platform
Interval simulation: raising the level of abstraction in architectural simulation
Detailed architectural simulators suffer from a long development cycle and extremely long evaluation times. This longstanding problem is further exacerbated in the multi-core processor era. Existing solutions address the simulation problem by either sampling the simulated instruction stream or by mapping the simulation models on FPGAs; these approaches achieve substantial simulation speedups while simulating performance in a cycle-accurate manner This paper proposes interval simulation which rakes a completely different approach: interval simulation raises the level of abstraction and replaces the core-level cycle-accurate simulation model by a mechanistic analytical model. The analytical model estimates core-level performance by analyzing intervals, or the timing between two miss events (branch mispredictions and TLB/cache misses); the miss events are determined through simulation of the memory hierarchy, cache coherence protocol, interconnection network and branch predictor By raising the level of abstraction, interval simulation reduces both development time and evaluation time. Our experimental results using the SPEC CPU2000 and PARSEC benchmark suites and the MS multi-core simulator show good accuracy up to eight cores (average error of 4.6% and max error of 11% for the multi-threaded full-system workloads), while achieving a one order of magnitude simulation speedup compared to cycle-accurate simulation. Moreover interval simulation is easy to implement: our implementation of the mechanistic analytical model incurs only one thousand lines of code. Its high accuracy, fast simulation speed and ease-of-use make interval simulation a useful complement to the architect's toolbox for exploring system-level and high-level micro-architecture trade-offs
An overview of the ciao multiparadigm language and program development environment and its design philosophy
We describe some of the novel aspects and motivations behind
the design and implementation of the Ciao multiparadigm programming system. An important aspect of Ciao is that it provides the programmer with a large number of useful features from different programming paradigms and styles, and that the use of each of these features can be turned on and off at will for each program module. Thus, a given module may be using e.g. higher order functions and constraints, while another module may be using objects, predicates, and concurrency. Furthermore, the language is designed to be extensible in a simple and modular way. Another important aspect of Ciao is its programming environment, which provides a powerful preprocessor (with an associated assertion language) capable of statically finding non-trivial bugs, verifying that programs comply with specifications, and performing many types of program optimizations. Such optimizations produce code that is highly competitive with other dynamic languages or, when the highest levéis of optimization are used, even that of static languages, all while retaining the interactive development environment of a dynamic language. The environment also includes a powerful auto-documenter. The paper provides an informal overview of the language and program development environment. It aims at illustrating the design philosophy rather than at being exhaustive, which would be impossible in the format of a paper, pointing instead to the existing literature on the system
Identifying, Quantifying, Extracting and Enhancing Implicit Parallelism
The shift of the microprocessor industry towards multicore architectures has
placed a huge burden on the programmers by requiring explicit parallelization
for performance. Implicit Parallelization is an alternative that could ease the
burden on programmers by parallelizing applications ???under the covers??? while
maintaining sequential semantics externally. This thesis develops a novel
approach for thinking about parallelism, by casting the problem of
parallelization in terms of instruction criticality. Using this approach,
parallelism in a program region is readily identified when certain conditions
about fetch-criticality are satisfied by the region. The thesis formalizes this
approach by developing a criticality-driven model of task-based
parallelization. The model can accurately predict the parallelism that would be
exposed by potential task choices by capturing a wide set of sources of
parallelism as well as costs to parallelization.
The criticality-driven model enables the development of two key components for
Implicit Parallelization: a task selection policy, and a bottleneck analysis
tool. The task selection policy can partition a single-threaded program into
tasks that will profitably execute concurrently on a multicore architecture in
spite of the costs associated with enforcing data-dependences and with
task-related actions. The bottleneck analysis tool gives feedback to the
programmers about data-dependences that limit parallelism. In particular, there
are several ???accidental dependences??? that can be easily removed with large
improvements in parallelism. These tools combine into a systematic methodology
for performance tuning in Implicit Parallelization. Finally, armed with the
criticality-driven model, the thesis revisits several architectural design
decisions, and finds several encouraging ways forward to increase the scope of
Implicit Parallelization.unpublishednot peer reviewe
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version
Compiler-directed energy reduction using dynamic voltage scaling and voltage Islands for embedded systems
Cataloged from PDF version of article.Addressing power and energy consumption related issues early in the system design flow ensures good design and
minimizes iterations for faster turnaround time. In particular, optimizations at software level, e.g., those supported by compilers, are
very important for minimizing energy consumption of embedded applications. Recent research demonstrates that voltage islands
provide the flexibility to reduce power by selectively shutting down the different regions of the chip and/or running the select parts of the
chip at different voltage/frequency levels. As against most of the prior work on voltage islands that mainly focused on the architecture
design and IP placement related issues, this paper studies the necessary software compiler support for voltage islands. Specifically,
we focus on an embedded multiprocessor architecture that supports both voltage islands and control domains within these islands, and
determine how an optimizing compiler can automatically map an embedded application onto this architecture. Such an automated
support is critical since it is unrealistic to expect an application programmer to reach a good mapping correlating multiple factors such
as performance and energy at the same time. Our experiments with the proposed compiler support show that our approach is very
effective in reducing energy consumption. The experiments also show that the energy savings we achieve are consistent across a wide
range of values of our major simulation parameters
- …