3,079 research outputs found

    Improving the Performance and Energy Efficiency of GPGPU Computing through Adaptive Cache and Memory Management Techniques

    Get PDF
    Department of Computer Science and EngineeringAs the performance and energy efficiency requirement of GPGPUs have risen, memory management techniques of GPGPUs have improved to meet the requirements by employing hardware caches and utilizing heterogeneous memory. These techniques can improve GPGPUs by providing lower latency and higher bandwidth of the memory. However, these methods do not always guarantee improved performance and energy efficiency due to the small cache size and heterogeneity of the memory nodes. While prior works have proposed various techniques to address this issue, relatively little work has been done to investigate holistic support for memory management techniques. In this dissertation, we analyze performance pathologies and propose various techniques to improve memory management techniques. First, we investigate the effectiveness of advanced cache indexing (ACI) for high-performance and energy-efficient GPGPU computing. Specifically, we discuss the designs of various static and adaptive cache indexing schemes and present implementation for GPGPUs. We then quantify and analyze the effectiveness of the ACI schemes based on a cycle-accurate GPGPU simulator. Our quantitative evaluation shows that ACI schemes achieve significant performance and energy-efficiency gains over baseline conventional indexing scheme. We also analyze the performance sensitivity of ACI to key architectural parameters (i.e., capacity, associativity, and ICN bandwidth) and the cache indexing latency. We also demonstrate that ACI continues to achieve high performance in various settings. Second, we propose IACM, integrated adaptive cache management for high-performance and energy-efficient GPGPU computing. Based on the performance pathology analysis of GPGPUs, we integrate state-of-the-art adaptive cache management techniques (i.e., cache indexing, bypassing, and warp limiting) in a unified architectural framework to eliminate performance pathologies. Our quantitative evaluation demonstrates that IACM significantly improves the performance and energy efficiency of various GPGPU workloads over the baseline architecture (i.e., 98.1% and 61.9% on average, respectively) and achieves considerably higher performance than the state-of-the-art technique (i.e., 361.4% at maximum and 7.7% on average). Furthermore, IACM delivers significant performance and energy efficiency gains over the baseline GPGPU architecture even when enhanced with advanced architectural technologies (e.g., higher capacity, associativity). Third, we propose bandwidth- and latency-aware page placement (BLPP) for GPGPUs with heterogeneous memory. BLPP analyzes the characteristics of a application and determines the optimal page allocation ratio between the GPU and CPU memory. Based on the optimal page allocation ratio, BLPP dynamically allocate pages across the heterogeneous memory nodes. Our experimental results show that BLPP considerably outperforms the baseline and state-of-the-art technique (i.e., 13.4% and 16.7%) and performs similar to the static-best version (i.e., 1.2% difference), which requires extensive offline profiling.clos

    Phase Transitions and Backbones of the Asymmetric Traveling Salesman Problem

    Full text link
    In recent years, there has been much interest in phase transitions of combinatorial problems. Phase transitions have been successfully used to analyze combinatorial optimization problems, characterize their typical-case features and locate the hardest problem instances. In this paper, we study phase transitions of the asymmetric Traveling Salesman Problem (ATSP), an NP-hard combinatorial optimization problem that has many real-world applications. Using random instances of up to 1,500 cities in which intercity distances are uniformly distributed, we empirically show that many properties of the problem, including the optimal tour cost and backbone size, experience sharp transitions as the precision of intercity distances increases across a critical value. Our experimental results on the costs of the ATSP tours and assignment problem agree with the theoretical result that the asymptotic cost of assignment problem is pi ^2 /6 the number of cities goes to infinity. In addition, we show that the average computational cost of the well-known branch-and-bound subtour elimination algorithm for the problem also exhibits a thrashing behavior, transitioning from easy to difficult as the distance precision increases. These results answer positively an open question regarding the existence of phase transitions in the ATSP, and provide guidance on how difficult ATSP problem instances should be generated

    A Parameterisation of Algorithms for Distributed Constraint Optimisation via Potential Games

    No full text
    This paper introduces a parameterisation of learning algorithms for distributed constraint optimisation problems (DCOPs). This parameterisation encompasses many algorithms developed in both the computer science and game theory literatures. It is built on our insight that when formulated as noncooperative games, DCOPs form a subset of the class of potential games. This result allows us to prove convergence properties of algorithms developed in the computer science literature using game theoretic methods. Furthermore, our parameterisation can assist system designers by making the pros and cons of, and the synergies between, the various DCOP algorithm components clear

    OS-Assisted Task Preemption for Hadoop

    Full text link
    This work introduces a new task preemption primitive for Hadoop, that allows tasks to be suspended and resumed exploiting existing memory management mechanisms readily available in modern operating systems. Our technique fills the gap that exists between the two extremes cases of killing tasks (which waste work) or waiting for their completion (which introduces latency): experimental results indicate superior performance and very small overheads when compared to existing alternatives

    Tuning the Level of Concurrency in Software Transactional Memory: An Overview of Recent Analytical, Machine Learning and Mixed Approaches

    Get PDF
    Synchronization transparency offered by Software Transactional Memory (STM) must not come at the expense of run-time efficiency, thus demanding from the STM-designer the inclusion of mechanisms properly oriented to performance and other quality indexes. Particularly, one core issue to cope with in STM is related to exploiting parallelism while also avoiding thrashing phenomena due to excessive transaction rollbacks, caused by excessively high levels of contention on logical resources, namely concurrently accessed data portions. A means to address run-time efficiency consists in dynamically determining the best-suited level of concurrency (number of threads) to be employed for running the application (or specific application phases) on top of the STM layer. For too low levels of concurrency, parallelism can be hampered. Conversely, over-dimensioning the concurrency level may give rise to the aforementioned thrashing phenomena caused by excessive data contention—an aspect which has reflections also on the side of reduced energy-efficiency. In this chapter we overview a set of recent techniques aimed at building “application-specific” performance models that can be exploited to dynamically tune the level of concurrency to the best-suited value. Although they share some base concepts while modeling the system performance vs the degree of concurrency, these techniques rely on disparate methods, such as machine learning or analytic methods (or combinations of the two), and achieve different tradeoffs in terms of the relation between the precision of the performance model and the latency for model instantiation. Implications of the different tradeoffs in real-life scenarios are also discussed
    • 

    corecore