1,742 research outputs found
Thread-spawning schemes for speculative multithreading
Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version
Thread partitioning and value prediction for exploiting speculative thread-level parallelism
Speculative thread-level parallelism has been recently proposed as a source of parallelism to improve the performance in applications where parallel threads are hard to find. However, the efficiency of this execution model strongly depends on the performance of the control and data speculation techniques. Several hardware-based schemes for partitioning the program into speculative threads are analyzed and evaluated. In general, we find that spawning threads associated to loop iterations is the most effective technique. We also show that value prediction is critical for the performance of all of the spawning policies. Thus, a new value predictor, the increment predictor, is proposed. This predictor is specially oriented for this kind of architecture and clearly outperforms the adapted versions of conventional value predictors such as the last value, the stride, and the context-based, especially for small-sized history tables.Peer ReviewedPostprint (published version
ret2spec: Speculative Execution Using Return Stack Buffers
Speculative execution is an optimization technique that has been part of CPUs
for over a decade. It predicts the outcome and target of branch instructions to
avoid stalling the execution pipeline. However, until recently, the security
implications of speculative code execution have not been studied.
In this paper, we investigate a special type of branch predictor that is
responsible for predicting return addresses. To the best of our knowledge, we
are the first to study return address predictors and their consequences for the
security of modern software. In our work, we show how return stack buffers
(RSBs), the core unit of return address predictors, can be used to trigger
misspeculations. Based on this knowledge, we propose two new attack variants
using RSBs that give attackers similar capabilities as the documented Spectre
attacks. We show how local attackers can gain arbitrary speculative code
execution across processes, e.g., to leak passwords another user enters on a
shared system. Our evaluation showed that the recent Spectre countermeasures
deployed in operating systems can also cover such RSB-based cross-process
attacks. Yet we then demonstrate that attackers can trigger misspeculation in
JIT environments in order to leak arbitrary memory content of browser
processes. Reading outside the sandboxed memory region with JIT-compiled code
is still possible with 80\% accuracy on average.Comment: Updating to the cam-ready version and adding reference to the
original pape
Data speculative multithreaded architecture
We present a novel processor microarchitecture that relieves three of the most important bottlenecks of superscalar processors: the serialization imposed by true dependences, the relatively small window size and the instruction fetch bandwidth. The new architecture executes simultaneously multiple threads of control obtained from a single program by means of control speculation techniques that do not require any compiler/user support nor any special feature in the instruction set architecture. The multiple simultaneous threads execute different iterations of the same loop, which require the same fetch bandwidth as a single thread since they share the same code. Inter-thread dependences as well as the values that flow through them are speculated by means of data prediction techniques. The preliminary evaluation results show a significant speed-up when compared with a superscalar processor. In fact, the new processor architecture can achieve an IPC (instructions per cycle) rate even larger than the peak fetch bandwidthPeer ReviewedPostprint (published version
Control speculation in multithreaded processors through dynamic loop detection
This paper presents a mechanism to dynamically detect the loops that are executed in a program. This technique detects the beginning and the termination of the iterations and executions of the loops without compiler/user intervention. We propose to apply this dynamic loop detection to the speculation of multiple threads of control dynamically obtained from a sequential program. Based an the highly predictable behavior of the loops, the history of the past executed loops is used to speculate the future instruction sequence. The overall objective is to dynamically obtain coarse grain parallelism (at the thread level) that can be exploited by a multithreaded architecture. We show that for a 4-context multithreaded processor the speculation mechanism provides around 2.6 concurrent threads in average.Peer ReviewedPostprint (published version
Adaptive runtime-assisted block prefetching on chip-multiprocessors
Memory stalls are a significant source of performance degradation in modern processors. Data prefetching is a widely adopted and well studied technique used to alleviate this problem. Prefetching can be performed by the hardware, or be initiated and controlled by software. Among software controlled prefetching we find a wide variety of schemes, including runtime-directed prefetching and more specifically runtime-directed block prefetching. This paper proposes a hybrid prefetching mechanism that integrates a software driven block prefetcher with existing hardware prefetching techniques. Our runtime-assisted software prefetcher brings large blocks of data on-chip with the support of a low cost hardware engine, and synergizes with existing hardware prefetchers that manage locality at a finer granularity. The runtime system that drives the prefetch engine dynamically selects which cache to prefetch to. Our evaluation on a set of scientific benchmarks obtains a maximum speed up of 32 and 10 % on average compared to a baseline with hardware prefetching only. As a result, we also achieve a reduction of up to 18 and 3 % on average in energy-to-solution.Peer ReviewedPostprint (author's final draft
- …