28 research outputs found

    Thread-spawning schemes for speculative multithreading

    Get PDF
    Speculative multithreading has been recently proposed to boost performance by means of exploiting thread-level parallelism in applications difficult to parallelize. The performance of these processors heavily depends on the partitioning policy used to split the program into threads. Previous work uses heuristics to spawn speculative threads based on easily-detectable program constructs such as loops or subroutines. In this work we propose a profile-based mechanism to divide programs into threads by searching for those parts of the code that have certain features that could benefit from potential thread-level parallelism. Our profile-based spawning scheme is evaluated on a Clustered Speculative Multithreaded Processor and results show large performance benefits. When the proposed spawning scheme is compared with traditional heuristics, we outperform them by almost 20%. When a realistic value predictor and a 8-cycle thread initialization penalty is considered, the performance difference between them is maintained. The speed-up over a single thread execution is higher than 5x for a 16-thread-unit processor and close to 2x for a 4-thread-unit processor.Peer ReviewedPostprint (published version

    Exploiting semantic commutativity in hardware speculation

    Get PDF
    Hardware speculative execution schemes such as hardware transactional memory (HTM) enjoy low run-time overheads but suffer from limited concurrency because they rely on reads and writes to detect conflicts. By contrast, software speculation schemes can exploit semantic knowledge of concurrent operations to reduce conflicts. In particular, they often exploit that many operations on shared data, like insertions into sets, are semantically commutative: they produce semantically equivalent results when reordered. However, software techniques often incur unacceptable run-time overheads. To solve this dichotomy, we present COMMTM, an HTM that exploits semantic commutativity. CommTM extends the coherence protocol and conflict detection scheme to support user-defined commutative operations. Multiple cores can perform commutative operations to the same data concurrently and without conflicts. CommTM preserves transactional guarantees and can be applied to arbitrary HTMs. CommTM scales on many operations that serialize in conventional HTMs, like set insertions, reference counting, and top-K insertions, and retains the low overhead of HTMs. As a result, at 128 cores, CommTM outperforms a conventional eager-lazy HTM by up to 3.4 χ and reduces or eliminates aborts.National Science Foundation (U.S.) (Grant CAREER-1452994

    A scalable architecture for ordered parallelism

    Get PDF
    We present Swarm, a novel architecture that exploits ordered irregular parallelism, which is abundant but hard to mine with current software and hardware techniques. In this architecture, programs consist of short tasks with programmer-specified timestamps. Swarm executes tasks speculatively and out of order, and efficiently speculates thousands of tasks ahead of the earliest active task to uncover ordered parallelism. Swarm builds on prior TLS and HTM schemes, and contributes several new techniques that allow it to scale to large core counts and speculation windows, including a new execution model, speculation-aware hardware task management, selective aborts, and scalable ordered commits. We evaluate Swarm on graph analytics, simulation, and database benchmarks. At 64 cores, Swarm achieves 51--122Ă— speedups over a single-core system, and out-performs software-only parallel algorithms by 3--18Ă—.National Science Foundation (U.S.) (Award CAREER-145299

    Implicitly-multithreaded processors

    Get PDF
    This paper proposes the Implicitly-MultiThreaded (IMT) architecture to execute compiler-specified speculative threads on to a modified Simultaneous Multithreading pipeline. IMT reduces hardware complexity by relying on the compiler to select suitable thread spawning points and orchestrate inter-thread register communication. To enhance IMT's effectiveness, this paper proposes three novel microarchitectural mechanisms: (1) resource- and dependence-based fetch policy to fetch and execute suitable instructions, (2) context multiplexing to improve utilization and map as many threads to a single context as allowed by availability of resources, and (3) early thread-invocation to hide thread start-up overhead by overlapping one thread's invocation with other threads' execution. We use SPEC2K benchmarks and cycle-accurate simulation to show that an microarchitecture-optimized IMT improves performance on average by 24% and at best by 69% over an aggressive superscalar. We also compare IMT to two prior proposals, TME and DMT, for speculative threading on an SMT using hardware-extracted threads. Our best IMT design outperforms a comparable TME and DMT on average by 26% and 38% respectively

    Smart Memories: A Modular Reconfigurable Architecture

    Full text link

    Parallelizing sequential applications on commodity hardware using a low-cost software transactional memory

    Full text link
    corecore