73 research outputs found

    Pegasus: Performance Engineering for Software Applications Targeting HPC Systems

    Get PDF
    Developing and optimizing software applications for high performance and energy efficiency is a very challenging task, even when considering a single target machine. For instance, optimizing for multicore-based computing systems requires in-depth knowledge about programming languages, application programming interfaces, compilers, performance tuning tools, and computer architecture and organization. Many of the tasks of performance engineering methodologies require manual efforts and the use of different tools not always part of an integrated toolchain. This paper presents Pegasus, a performance engineering approach supported by a framework that consists of a source-to-source compiler, controlled and guided by strategies programmed in a Domain-Specific Language, and an autotuner. Pegasus is a holistic and versatile approach spanning various decision layers composing the software stack, and exploiting the system capabilities and workloads effectively through the use of runtime autotuning. The Pegasus approach helps developers by automating tasks regarding the efficient implementation of software applications in multicore computing systems. These tasks focus on application analysis, profiling, code transformations, and the integration of runtime autotuning. Pegasus allows developers to program their strategies or to automatically apply existing strategies to software applications in order to ensure the compliance of non-functional requirements, such as performance and energy efficiency. We show how to apply Pegasus and demonstrate its applicability and effectiveness in a complex case study, which includes tasks from a smart navigation system

    GraphBLAST: A High-Performance Linear Algebra-based Graph Framework on the GPU

    Full text link
    High-performance implementations of graph algorithms are challenging to implement on new parallel hardware such as GPUs because of three challenges: (1) the difficulty of coming up with graph building blocks, (2) load imbalance on parallel hardware, and (3) graph problems having low arithmetic intensity. To address some of these challenges, GraphBLAS is an innovative, on-going effort by the graph analytics community to propose building blocks based on sparse linear algebra, which will allow graph algorithms to be expressed in a performant, succinct, composable and portable manner. In this paper, we examine the performance challenges of a linear-algebra-based approach to building graph frameworks and describe new design principles for overcoming these bottlenecks. Among the new design principles is exploiting input sparsity, which allows users to write graph algorithms without specifying push and pull direction. Exploiting output sparsity allows users to tell the backend which values of the output in a single vectorized computation they do not want computed. Load-balancing is an important feature for balancing work amongst parallel workers. We describe the important load-balancing features for handling graphs with different characteristics. The design principles described in this paper have been implemented in "GraphBLAST", the first high-performance linear algebra-based graph framework on NVIDIA GPUs that is open-source. The results show that on a single GPU, GraphBLAST has on average at least an order of magnitude speedup over previous GraphBLAS implementations SuiteSparse and GBTL, comparable performance to the fastest GPU hardwired primitives and shared-memory graph frameworks Ligra and Gunrock, and better performance than any other GPU graph framework, while offering a simpler and more concise programming model.Comment: 50 pages, 14 figures, 14 table

    Memory Subsystem Optimization Techniques for Modern High-Performance General-Purpose Processors

    Get PDF
    abstract: General-purpose processors propel the advances and innovations that are the subject of humanity’s many endeavors. Catering to this demand, chip-multiprocessors (CMPs) and general-purpose graphics processing units (GPGPUs) have seen many high-performance innovations in their architectures. With these advances, the memory subsystem has become the performance- and energy-limiting aspect of CMPs and GPGPUs alike. This dissertation identifies and mitigates the key performance and energy-efficiency bottlenecks in the memory subsystem of general-purpose processors via novel, practical, microarchitecture and system-architecture solutions. Addressing the important Last Level Cache (LLC) management problem in CMPs, I observe that LLC management decisions made in isolation, as in prior proposals, often lead to sub-optimal system performance. I demonstrate that in order to maximize system performance, it is essential to manage the LLCs while being cognizant of its interaction with the system main memory. I propose ReMAP, which reduces the net memory access cost by evicting cache lines that either have no reuse, or have low memory access cost. ReMAP improves the performance of the CMP system by as much as 13%, and by an average of 6.5%. Rather than the LLC, the L1 data cache has a pronounced impact on GPGPU performance by acting as the bandwidth filter for the rest of the memory subsystem. Prior work has shown that the severely constrained data cache capacity in GPGPUs leads to sub-optimal performance. In this thesis, I propose two novel techniques that address the GPGPU data cache capacity problem. I propose ID-Cache that performs effective cache bypassing and cache line size selection to improve cache capacity utilization. Next, I propose LATTE-CC that considers the GPU’s latency tolerance feature and adaptively compresses the data stored in the data cache, thereby increasing its effective capacity. ID-Cache and LATTE-CC are shown to achieve 71% and 19.2% speedup, respectively, over a wide variety of GPGPU applications. Complementing the aforementioned microarchitecture techniques, I identify the need for system architecture innovations to sustain performance scalability of GPG- PUs in the face of slowing Moore’s Law. I propose a novel GPU architecture called the Multi-Chip-Module GPU (MCM-GPU) that integrates multiple GPU modules to form a single logical GPU. With intelligent memory subsystem optimizations tailored for MCM-GPUs, it can achieve within 7% of the performance of a similar but hypothetical monolithic die GPU. Taking a step further, I present an in-depth study of the energy-efficiency characteristics of future MCM-GPUs. I demonstrate that the inherent non-uniform memory access side-effects form the key energy-efficiency bottleneck in the future. In summary, this thesis offers key insights into the performance and energy-efficiency bottlenecks in CMPs and GPGPUs, which can guide future architects towards developing high-performance and energy-efficient general-purpose processors.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Graph Algorithms on GPUs

    Get PDF
    This chapter introduces the topic of graph algorithms on GPUs. It starts by presenting and comparing the main important data structures and techniques applied for representing and analysing graphs on GPUs at the state of the art.It then presents the theory and an updated review of the most efficient implementations of graph algorithms for GPUs. In particular, the chapter focuses on graph traversal algorithms (breadth-first search), single-source shortest path(Djikstra, Bellman-Ford, delta stepping, hybrids), and all-pair shortest path (Floyd-Warshall). By the end of the chapter, load balancing and memory access techniques are discussed through an overview of their main issues and management techniques

    Graph analytics on modern massively parallel systems

    Get PDF
    Graphs provide a very flexible abstraction for understanding and modeling complex systems in many fields such as physics, biology, neuroscience, engineering, and social science. Only in the last two decades, with the advent of Big Data era, supercomputers equipped by accelerators –i.e., Graphics Processing Unit (GPUs)–, advanced networking, and highly parallel file systems have been used to analyze graph properties such as reachability, diameter, connected components, centrality, and clustering coefficient. Today graphs of interest may be composed by millions, sometimes billions, of nodes and edges and exhibit a highly irregular structure. As a consequence, the design of efficient and scalable graph algorithms is an extraordinary challenge due to irregular communication and memory access patterns, high synchronization costs, and lack of data locality. In the present dissertation, we start off with a brief and gentle introduction for the reader to graph analytics and massively parallel systems. In particular, we present the intersection between graph analytics and parallel architectures in the current state-of-the-art and discuss the challenges encountered when solving such problems on large-scale graphs on these architectures (Chapter 1). In Chapter 2, some preliminary definitions and graph-theoretical notions are provided together with a description of the synthetic graphs used in the literature to model real-world networks. In Chapters 3-5, we present and tackle three different relevant problems in graph analysis: reachability (Chapter 3), Betweenness Centrality (Chapter 4), and clustering coefficient (Chapter 5). In detail, Chapter 3 tackles reachability problems by providing two scalable algorithms and implementations which efficiently solve st-connectivity problems on very large-scale graphs Chapter 4 considers the problem of identifying most relevant nodes in a network which plays a crucial role in several applications, including transportation and communication networks, social network analysis, and biological networks. In particular, we focus on a well-known centrality metrics, namely Betweenness Centrality (BC), and present two different distributed algorithms for the BC computation on unweighted and weighted graphs. For unweighted graphs, we present a new communication-efficient algorithm based on the combination of bi-dimensional (2D) decomposition and multi-level parallelism. Furthermore, new algorithms which exploit the underlying graph topology to reduce the time and space usage of betweenness centrality computations are described as well. Concerning weighted graphs, we provide a scalable algorithm based on an algebraic formulation of the problem. Finally, thorough comprehensive experimental results on synthetic and real- world large-scale graphs, we show that the proposed techniques are effective in practice and achieve significant speedups against state-of-the-art solutions. Chapter 5 considers clustering coefficients problem. Similarly to Betweenness Centrality, it is a fundamental tool in network analysis, as it specifically measures how nodes tend to cluster together in a network. In the chapter, we first extend caching techniques to Remote Memory Access (RMA) operations on distributed-memory system. The caching layer is mainly designed to avoid inter-node communications in order to achieve similar benefits for irregular applications as communication-avoiding algorithms. We also show how cached RMA is able to improve the performance of a new distributed asynchronous algorithm for the computation of local clustering coefficients. Finally, Chapter 6 contains a brief summary of the key contributions described in the dissertation and presents potential future directions of the work

    DAMOV: A New Methodology and Benchmark Suite for Evaluating Data Movement Bottlenecks

    Full text link
    Data movement between the CPU and main memory is a first-order obstacle against improving performance, scalability, and energy efficiency in modern systems. Computer systems employ a range of techniques to reduce overheads tied to data movement, spanning from traditional mechanisms (e.g., deep multi-level cache hierarchies, aggressive hardware prefetchers) to emerging techniques such as Near-Data Processing (NDP), where some computation is moved close to memory. Our goal is to methodically identify potential sources of data movement over a broad set of applications and to comprehensively compare traditional compute-centric data movement mitigation techniques to more memory-centric techniques, thereby developing a rigorous understanding of the best techniques to mitigate each source of data movement. With this goal in mind, we perform the first large-scale characterization of a wide variety of applications, across a wide range of application domains, to identify fundamental program properties that lead to data movement to/from main memory. We develop the first systematic methodology to classify applications based on the sources contributing to data movement bottlenecks. From our large-scale characterization of 77K functions across 345 applications, we select 144 functions to form the first open-source benchmark suite (DAMOV) for main memory data movement studies. We select a diverse range of functions that (1) represent different types of data movement bottlenecks, and (2) come from a wide range of application domains. Using NDP as a case study, we identify new insights about the different data movement bottlenecks and use these insights to determine the most suitable data movement mitigation mechanism for a particular application. We open-source DAMOV and the complete source code for our new characterization methodology at https://github.com/CMU-SAFARI/DAMOV.Comment: Our open source software is available at https://github.com/CMU-SAFARI/DAMO
    • …
    corecore