14 research outputs found
Energy-Effectiveness of Pre-Execution and Energy-Aware P-Thread Selection
Pre-execution removes the microarchitectural latency of problem loads from a program’s critical path by redundantly executing copies of their computations in parallel with the main program. There have been several proposed pre-execution systems, a quantitative framework (PTHSEL) for analytical pre-execution thread (p-thread) selection, and even a research prototype. To date, however, the energy aspects of pre-execution have not been studied.
Cycle-level performance and energy simulations on SPEC2000 integer benchmarks that suffer from L2 misses show that energy-blind pre-execution naturally has a linear latency/energy trade-off, improving performance by 13.8% while increasing energy consumption by 11.9%.
To improve this trade-off, we propose two extensions to PTHSEL. First, we replace the flat cycle-for-cycle load cost model with a model based on a critical-path estimation. This extension increases p-thread efficiency in an energy-independent way. Second, we add a parameterized energy model to PTHSEL (forming PTHSEL+E) that allows it to actively select p-threads that reduce energy rather than (or in combination with) execution latency.
Experiments show that PTHSEL+E manipulates preexecution’s latency/energy more effectively. Latency targeted selection benefits from the improved load cost model: its performance improvements grow to an average of 16.4% while energy costs drop to 8.7%. ED targeted selection produces p-threads that improve performance by only 12.9%, but ED by 8.8%. Targeting p-thread selection for energy reduction, results in energy-free pre-execution, with average speedup of 5.4%, and a small decrease in total energy consumption (0.7%)
Architectural support for probabilistic branches
A plethora of research efforts have focused on fine-tuning branch predictors to increasingly higher levels of accuracy. However, several important optimization, financial, and statistical data analysis algorithms rely on probabilistic computation. These applications draw random values from a distribution and steer control flow based on those values. Such probabilistic branches are challenging to predict because of their inherent probabilistic nature. As a result, probabilistic codes significantly suffer from branch mispredictions.
This paper proposes Probabilistic Branch Support (PBS), a hardware/software cooperative technique that leverages the observation that the outcome of probabilistic branches needs to be correct only in a statistical sense. PBS stores the outcome and the probabilistic values that lead to the outcome of the current execution to direct the next execution of the probabilistic branch, thereby completely removing the penalty for mispredicted probabilistic branches. PBS relies on marking probabilistic branches in software for hardware to exploit. Our evaluation shows that PBS improves MPKI by 45% on average (and up to 99%) and IPC by 6.7% (up to 17%) over the TAGE-SC-L predictor. PBS requires 193 bytes of hardware overhead and introduces statistically negligible algorithmic inaccuracy
Symbiotic Subordinate Threading (SST)
Integration of multiple processor cores on a single die, relatively constant die sizes, increasing memory latencies, and emerging new applications create new challenges and opportunities for processor architects. How
to build a multi-core processor that provides high single-thread performance while enabling high throughput through multi-programming?
Conventional approaches for high single-thread
performance use a large instruction window for memory latency tolerance, which requires large and complex cores. However, to be able to integrate more cores on the same die for high throughput, cores must be simpler and smaller.
We present an architecture that obtains high performance for single-threaded applications in a multi-core environment, while using simpler cores to meet the high throughput requirement. Our scheme, called Symbiotic Subordinate Threading (SST), achieves the benefits of a large instruction window by utilizing otherwise idle cores to run dynamically constructed subordinate
threads (a.k.a. {\em helper threads}) for the individual threads running on the active cores.
In our proposed execution paradigm, the subordinate thread fetches and pre-processes
instruction streams and retires processed instructions into a buffer for the main thread to consume. The subordinate thread executes a smaller version of the program executed by the
main thread. As a result, it runs far ahead to warm up the data caches and fix branch miss-predictions for the main thread. In-flight instructions are present in the subordinate
thread, the buffer, and the main thread, forming a very large effective instruction window for single-thread out-of-order execution. Moreover, using a simple technique of identifying the subordinate thread non-speculative results, the main thread can integrate the subordinate thread's non-speculative results directly into its state without having to execute their
corresponding instructions. In this way, the main thread is sped up because it also executes a smaller version of the program, and the total number of instructions executed is minimized, thereby achieving an efficient utilization of the
hardware resources. The proposed SST architecture does not require large register files, issue
queues, load/store queues, or reorder buffers. In addition, it incurs only minor hardware additions/changes. Experimental results show remarkable latency-hiding capabilities of the
proposed SST architecture, outperforming existing architectures that share similar high-level microarchitecture
Mixed Speculative Multithreaded Execution Models
Institute for Computing Systems ArchitectureThe current trend toward chip multiprocessor architectures has placed great pressure
on programmers and compilers to generate thread-parallel programs. Improved execution
performance can no longer be obtained via traditional single-thread instruction
level parallelism (ILP), but, instead, via multithreaded execution. One notable technique
that facilitates the extraction of parallel threads from sequential applications is
thread-level speculation (TLS). This technique allows programmers/compilers to generate
threads without checking for inter-thread data and control dependences, which
are then transparently enforced by the hardware. Most prior work on TLS has concentrated
on thread selection and mechanisms to efficiently support the main TLS operations,
such as squashes, data versioning, and commits.
This thesis seeks to enhance TLS functionality by combining it with other speculative
multithreaded execution models. The main idea is that TLS already requires
extensive hardware support, which when slightly augmented can accommodate other
speculative multithreaded techniques. Recognizing that for different applications, or
even program phases, the application bottlenecks may be different, it is reasonable to
assume that the more versatile a system is, the more efficiently it will be able to execute
the given program.
As mentioned above, generating thread-parallel programs is hard and TLS has
been suggested as an execution model that can speculatively exploit thread-level parallelism
(TLP) even when thread independence cannot be guaranteed by the programmer/
compiler. Alternatively, the helper threads (HT) execution model has been proposed
where subordinate threads are executed in parallel with a main thread in order to
improve the execution efficiency (i.e., ILP) of the latter. Yet another execution model,
runahead execution (RA), has also been proposed where subordinate versions of the
main thread are dynamically created especially to cope with long-latency operations,
again with the aim of improving the execution efficiency of the main thread (ILP).
Each one of these multithreaded execution models works best for different applications
and application phases. We combine these three models into a single execution
model and single hardware infrastructure such that the system can dynamically adapt
to find the most appropriate multithreaded execution model. More specifically, TLS is favored whenever successful parallel execution of instructions in multiple threads
(i.e., TLP) is possible and the system can seamlessly transition at run-time to the other
models otherwise. In order to understand the tradeoffs involved, we also develop a performance
model that allows one to quantitatively attribute overall performance gains
to either TLP or ILP in such combined multithreaded execution model.
Experimental results show that our combined execution model achieves speedups
of up to 41.2%, with an average of 10.2%, over an existing state-of-the-art TLS system
and speedups of up to 35.2%, with an average of 18.3%, over a flavor of runahead
execution for a subset of the SPEC2000 Integer benchmark suite.
We then investigate how a common ILP-enhancingmicroarchitectural feature, namely
branch prediction, interacts with TLS.We show that branch prediction for TLS is even
more important than it is for single core machines. Unfortunately, branch prediction for
TLS systems is also inherently harder. Code partitioning and re-executions of squashed
threads pollute the branch history making it harder for predictors to be accurate.
We thus propose to augment the hardware, so as to accommodate Multi-Path (MP)
execution within the existing TLS protocol. Under the MP execution model, all paths
following a number of hard-to-predict conditional branches are followed. MP execution
thus, removes branches that would have been otherwise mispredicted helping in
this way the processor to exploit more ILP. We show that with only minimal hardware
support, one can combine these two execution models into a unified one, which can
achieve far better performance than both TLS and MP execution.
Experimental results show that our combied execution model achieves speedups of
up to 20.1%, with an average of 8.8%, over an existing state-of-the-art TLS system and
speedups of up to 125%, with an average of 29.0%, when compared with multi-path
execution for a subset of the SPEC2000 Integer benchmark suite.
Finally, Since systems that support speculative multithreading usually treat all
threads equally, they are energy-inefficient. This inefficiency stems from the fact that
speculation occasionally fails and, thus, power is spent on threads that will have to
be discarded. We propose a profitability-based power allocation scheme, where we
“steal” power from non-profitable threads and use it to speed up more useful ones. We
evaluate our techniques for a state-of-the-art TLS system and show that, with minimalhardware support, we achieve improvements in ED of up to 25.5% with an average of
18.9%, for a subset of the SPEC 2000 Integer benchmark suite
Mitosis based speculative multithreaded architectures
In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream,
with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread
applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law.
Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main
directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on
both these approaches, the proposed techniques so far have shown marginal performance improvements.
In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices).
Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version