39 research outputs found

    Trace-level speculative multithreaded architecture

    Get PDF
    This paper presents a novel microarchitecture to exploit trace-level speculation by means of two threads working cooperatively in a speculative and non-speculative way respectively. The architecture presents two main benefits: (a) no significant penalties are introduced in the presence of a misspeculation and (b) any type of trace predictor can work together with this proposal. In this way, aggressive trace predictors can be incorporated since misspeculations do not introduce significant penalties. We describe in detail TSMA (trace-level speculative multithreaded architecture) and present initial results to show the benefits of this proposal. We show how simple trace predictors achieve significant speed-up in the majority of cases. Results of a simple trace speculation mechanism show an average speed-up of 16%.Peer ReviewedPostprint (published version

    Compiler analysis for trace-level speculative multithreaded architectures

    Get PDF
    Trace-level speculative multithreaded processors exploit trace-level speculation by means of two threads working cooperatively. One thread, called the speculative thread, executes instructions ahead of the other by speculating on the result of several traces. The other thread executes speculated traces and verifies the speculation made by the first thread. In this paper, we propose a static program analysis for identifying candidate traces to be speculated. This approach identifies large regions of code whose live-output values may be successfully predicted. We present several heuristics to determine the best opportunities for dynamic speculation, based on compiler analysis and program profiling information. Simulation results show that the proposed trace recognition techniques achieve on average a speed-up close to 38% for a collection of SPEC2000 benchmarks.Peer ReviewedPostprint (published version

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    Software-Controlled Instruction Prefetch Buffering for Low-End Processors

    Get PDF
    This paper proposes a method of buffering instructions by software-based prefetching. The method allows low-end processors to improve their instruction throughput with a minimum of additional logic and power consumption. Low-end embedded processors do not employ caches for mainly two reasons. The first reason is that the overhead of cache implementation in terms of energy and area is considerable. The second reason is that, because a cache's performance primarily depends on the number of hits, an increasing number of misses could cause a processor to remain in stall mode for a longer duration. As a result, a cache may become more of a liability than an advantage. In contrast, the benchmarked results for the proposed software-based prefetch buffering without a cache show a 5-10% improvement in execution time. They also show a 4% or more reduction in the energy-delay-square-product (ED2P) with a maximum reduction of 40%. The results additionally demonstrate that the performance and efficiency of the proposed architecture scales with the number of multicycle instructions. The benchmarked routines tested to arrive at these results are widely deployed components of embedded applications

    Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy

    Get PDF
    Microprocessor performance has been increasing at an exponential rate while memory system performance improved at a linear rate. This widening difference in performances is increasingly rendering advances in computer architecture less useful as more instructions spend more time waiting for data to be fetched from the memory after a cache miss. Data prefetching is a technique that avoids some cache misses by bringing data into the cache before it is actually needed. Different approaches to data prefetching have been developed, however existing prefetch schemes do not eliminate all cache misses and even with smaller cache miss ratio, miss latency remains an important performance limiter. In this thesis, we propose a technique called Compiler Assisted Cache Prefetch Using Procedure Call Hierarchy (CAPPH). It is a hardware-software prefetch technique that uses a compiler to provide information pertaining to data structure layout, data-flow and procedure-call hierarchy of the program to a mechanism that prefetches linked data structures (LDS). It can prefetch data for procedures even before they are called by using this statically generated information. It is also capable of issuing prefetches for recursive functions that access LDS and arbitrary access sequences which are otherwise difficult to prefetch. The scheme is simulated using RSIML, a SPARC v8 simulator. Benchmarks em3d, health and mst from the Olden suite were used. The scheme was compared with an otherwise identical system with no prefetch and one using sequential prefetch. Simulations were performed to measure CAPPH performance and the decrease in the miss ratio of loads accessing LDS. Statistics of individual loads were collected, and accuracy, coverage and timeliness were measured against varying cache size and latency. Results from individual loads accessing linked data structures show considerable decrease in their miss ratios and average access times. CAPPH is found to be more accurate than sequential prefetch. The coverage and timeliness are lower in CAPPH than in sequential prefetch. We suggest heuristics to further enhance the effectiveness of the prefetch technique
    corecore