161 research outputs found

    ILP and TLP in Shared Memory Applications: A Limit Study

    Get PDF
    The work in this dissertation explores the limits of Chip-multiprocessors (CMPs) with respect to shared-memory, multi-threaded benchmarks, which will help aid in identifying microarchitectural bottlenecks. This, in turn, will lead to more efficient CMP design. In the first part we introduce DotSim, a trace-driven toolkit designed to explore the limits of instruction and thread-level scaling and identify microarchitectural bottlenecks in multi-threaded applications. DotSim constructs an instruction-level Data Flow Graph (DFG) from each thread in multi-threaded applications, adjusting for inter-thread dependencies. The DFGs dynamically change depending on the microarchitectural constraints applied. Exploiting these DFGs allows for the easy extraction of the performance upper bound. We perform a case study on modeling the upper-bound performance limits of a processor microarchitecture modeled off a AMD Opteron. In the second part, we conduct a limit study simultaneously analyzing the two dominant forms of parallelism exploited by modern computer architectures: Instruction Level Parallelism (ILP) and Thread Level Parallelism (TLP). This study gives insight into the upper bounds of performance that future architectures can achieve. Furthermore, it identifies the bottlenecks of emerging workloads. To the best of our knowledge, our work is the first study that combines the two forms of parallelism into one study with modern applications. We evaluate the PARSEC multithreaded benchmark suite using DotSim. We make several contributions describing the high-level behavior of next-generation applications. For example, we show that these applications contain up to a factor of 929X more ILP than what is currently being extracted from real machines. We then show the effects of breaking the application into increasing numbers of threads (exploiting TLP), instruction window size, realistic branch prediction, realistic memory latency, and thread dependencies on exploitable ILP. Our examination shows that theses benchmarks differ vastly from one another. As a result, we expect that no single, homogeneous, micro-architecture will work optimally for all, arguing for reconfigurable, heterogeneous designs. In the third part of this thesis, we use our novel simulator DotSim to study the benefits of prefetching shared memory within critical sections. In this chapter we calculate the upper bound of performance under our given constraints. Our intent is to provide motivation for new techniques to exploit the potential benefits of reducing latency of shared memory among threads. We conduct an idealized workload characterization study focusing on the data that is truly shared among threads, using a simplified memory model. We explore the degree of shared memory criticality, and characterize the benefits of being able to use latency reducing techniques to reduce execution time and increase ILP. We find that on average true sharing among benchmarks is quite low compared to overall memory accesses on the critical path and overall program. We also find that truly shared memory between threads does not affect the critical path for the majority of benchmarks, and when it does the impact is less than 1%. Therefore, we conclude that it is not worth exploring latency reducing techniques of truly shared memory within critical sections

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application.Postprint (published version

    Mitosis based speculative multithreaded architectures

    Get PDF
    In the last decade, industry made a right-hand turn and shifted towards multi-core processor designs, also known as Chip-Multi-Processors (CMPs), in order to provide further performance improvements under a reasonable power budget, design complexity, and validation cost. Over the years, several processor vendors have come out with multi-core chips in their product lines and they have become mainstream, with the number of cores increasing in each processor generation. Multi-core processors improve the performance of applications by exploiting Thread Level Parallelism (TLP) while the Instruction Level Parallelism (ILP) exploited by each individual core is limited. These architectures are very efficient when multiple threads are available for execution. However, single-thread sections of code (single-thread applications and serial sections of parallel applications) pose important constraints on the benefits achieved by parallel execution, as pointed out by Amdahl’s law. Parallel programming, even with the help of recently proposed techniques like transactional memory, has proven to be a very challenging task. On the other hand, automatically partitioning applications into threads may be a straightforward task in regular applications, but becomes much harder for irregular programs, where compilers usually fail to discover sufficient TLP. In this scenario, two main directions have been followed in the research community to take benefit of multi-core platforms: Speculative Multithreading (SpMT) and Non-Speculative Clustered architectures. The former splits a sequential application into speculative threads, while the later partitions the instructions among the cores based on data-dependences but avoid large degree of speculation. Despite the large amount of research on both these approaches, the proposed techniques so far have shown marginal performance improvements. In this thesis we propose novel schemes to speed-up sequential or lightly threaded applications in multi-core processors that effectively address the main unresolved challenges of previous approaches. In particular, we propose a SpMT architecture, called Mitosis, that leverages a powerful software value prediction technique to manage inter-thread dependences, based on pre-computation slices (p-slices). Thanks to the accuracy and low cost of this technique, Mitosis is able to effectively parallelize applications even in the presence of frequent dependences among threads. We also propose a novel architecture, called Anaphase, that combines the best of SpMT schemes and clustered architectures. Anaphase effectively exploits ILP, TLP and Memory Level Parallelism (MLP), thanks to its unique finegrain thread decomposition algorithm that adapts to the available parallelism in the application

    Exploiting a new level of DLP in multimedia applications

    Get PDF
    This paper proposes and evaluates MOM: a novel ISA paradigm targeted at multimedia applications. By fusing conventional vector ISA approaches together with more recent SIMD-like (Single Instruction Multiple Data) ISAs (such as MMX), we have developed a new matrix oriented ISA which efficiently deals with the small matrix structures typically found in multimedia applications. MOM exploits a level of DLP not reachable by neither conventional vector ISAs nor SIMD-like media ISA extensions. Our results show that MOM provides a factor of 1.3x to 4x performance improvement when compared with two different multimedia extensions (MMX and MDMX) on several kernels, which translates into up to a 50% of performance gain when measuring full applications (20% in average). Furthermore, the streaming nature of MOM provides additional advantages for executing multimedia applications, such as a very low fetch pressure or a high tolerance to memory latency, making MOM an ideal candidate for the embedded domain.Peer ReviewedPostprint (published version

    Software-Based Side Channel Attacks and the Future of Hardened Microarchitecture

    Get PDF
    Side channel attack vectors found in microarchitecture of computing devices expose systems to potentially system-level breaches. This thesis consists of a comprehensive report on current exploits of this nature, describing their fundamental basis and usage, paving the way to further research into hardware mitigations that may be utilized to combat these and future vulnerabilities. It will discuss several modern software-based side channel attacks, describing the mechanisms they utilize to gain access to privileged information. Attack vectors will be exemplified, along with applicability to various architectures utilized in modern computing. Finally, discussion of how future architectural changes must successfully harden chips against attacks of this type will occur, ending with a reinforced call for development of these integral architectural revisions to resolve the threat

    pocl: A Performance-Portable OpenCL Implementation

    Get PDF
    OpenCL is a standard for parallel programming of heterogeneous systems. The benefits of a common programming standard are clear; multiple vendors can provide support for application descriptions written according to the standard, thus reducing the program porting effort. While the standard brings the obvious benefits of platform portability, the performance portability aspects are largely left to the programmer. The situation is made worse due to multiple proprietary vendor implementations with different characteristics, and, thus, required optimization strategies. In this paper, we propose an OpenCL implementation that is both portable and performance portable. At its core is a kernel compiler that can be used to exploit the data parallelism of OpenCL programs on multiple platforms with different parallel hardware styles. The kernel compiler is modularized to perform target-independent parallel region formation separately from the target-specific parallel mapping of the regions to enable support for various styles of fine-grained parallel resources such as subword SIMD extensions, SIMD datapaths and static multi-issue. Unlike previous similar techniques that work on the source level, the parallel region formation retains the information of the data parallelism using the LLVM IR and its metadata infrastructure. This data can be exploited by the later generic compiler passes for efficient parallelization. The proposed open source implementation of OpenCL is also platform portable, enabling OpenCL on a wide range of architectures, both already commercialized and on those that are still under research. The paper describes how the portability of the implementation is achieved. Our results show that most of the benchmarked applications when compiled using pocl were faster or close to as fast as the best proprietary OpenCL implementation for the platform at hand.Comment: This article was published in 2015; it is now openly accessible via arxi

    A generic framework to integrate data caches in the WCET analysis of real-time systems

    Get PDF
    Worst-case execution time (WCET) analysis of systems with data caches is one of the key challenges in real-time systems. Caches exploit the inherent reuse properties of programs by temporarily storing certain memory contents near the processor, in order that further accesses to such contents do not require costly memory transfers. Current worst-case data cache analysis methods focus on specific cache organizations (set-associative LRU, locked, ACDC, etc.), most of the times adapting techniques designed to analyze instruction caches. On the other hand, there are methodologies to analyze the data reuse of a program, independently of the data cache. In this paper we propose a generic WCET analysis framework to analyze data caches taking profit of such reuse information. It includes the categorization of data references and their integration in an IPET model. We apply it to a conventional LRU cache, an ACDC, and other baseline systems, and compare them using the TACLeBench benchmark suite. Our results show that persistence-based LRU analyses dismiss essential information on data, and a reuse-based analysis improves the WCET bound around 17% in average. In general, the best WCET estimations are obtained with optimization level 2, where the ACDC cache performs 39% better than a set-associative LRU

    Exploiting compiler-generated schedules for energy savings in high-performance processors

    Get PDF
    This paper develops a technique that uniquely combines the advantages of static scheduling and dynamic scheduling to reduce the energy consumed in modern superscalar processors with out-of-order issue logic. In this Hybrid-Scheduling paradigm, regions of the application containing large amounts of parallelism visible at compile-time completely bypass the dynamic scheduling logic and execute in a low power static mode. Simulation studies using the Wattch framework on several media and scientific benchmarks demonstrate large improvements in overall energy consumption of 43 % in kernels and 25 % in full applications with only a 2.8 % performance degradation on average
    • …
    corecore