27,645 research outputs found

    Performance, Power Modeling and Optimization for High-Performance Computing Systems

    Get PDF
    University of Minnesota Ph.D. dissertation.October 2016. Major: Electrical/Computer Engineering. Advisor: John Sartori. 1 computer file (PDF); xi, 154 pages.Heterogeneity abounds in modern high-performance computing systems. Applications are heterogeneous, containing time-varying unbalanced utilization for different resources, and system architectures have become heterogeneous in order to achieve higher levels of performance and energy efficiency. The most powerful, and also the most energy-efficient high-performance computing systems today consist of many-core CPUs and GPGPUs with a variety of specialize on-chip and off-chip memories. These heterogeneous systems provide a huge amount of computing resources, but it is becoming increasingly challenging to use them effectively and efficiently to maximize their potential. This becomes an even more pressing challenge as energy efficiency becomes the primary barrier to achieving higher levels of performance. This thesis addresses the challenges of performance modeling and optimization in heterogeneous high-performance computing systems. Effective system optimization requires understanding of how performance and power change in response to optimizations. Therefore, we begin by summarizing the impact of modern architectural advances on performance and power modeling for chip multiprocessors (CMPs). We present two models that estimate the performance and power in such systems. The first model, CAMP, is a fast and accurate cache-aware performance model that estimates the performance degradation due to cache contention of processes running on cache-sharing cores. We then propose a system-level power model for a multi-programmed CMP environment that accounts for cache contention. We explain how to integrate the two models to enable power-aware process assignment. Then, we propose an off-chip memory access-aware runtime DVFS control technique that minimizes energy consumption subject to a constraint on application execution time. The second part of the dissertation focuses on improving performance for GPGPUs. After a thorough analysis on CPI breakdown, we lay out all the key factors that govern GPU throughput. In order to improve overall performance for GPGPUs, we propose two approaches that address the key factors, without introducing extra congestion and degradation to the system. We first propose a new two-level priority scheduling policy to improve overall performance by optimizing effective degree of parallelism. Then, we propose ICMT, a full, detailed solution for intra-core multitasking for GPGPUs, including architectural support and a contention-aware workload scheduling algorithm that improves all the key factors in a balanced fashion. Furthermore, we propose a new contention-aware analytical performance model that provides fine-grained workload scheduling decisions for intra-core multitasking, including detailed resource allocation from co-scheduled workloads

    Fairness-aware scheduling on single-ISA heterogeneous multi-cores

    Get PDF
    Single-ISA heterogeneous multi-cores consisting of small (e.g., in-order) and big (e.g., out-of-order) cores dramatically improve energy- and power-efficiency by scheduling workloads on the most appropriate core type. A significant body of recent work has focused on improving system throughput through scheduling. However, none of the prior work has looked into fairness. Yet, guaranteeing that all threads make equal progress on heterogeneous multi-cores is of utmost importance for both multi-threaded and multi-program workloads to improve performance and quality-of-service. Furthermore, modern operating systems affinitize workloads to cores (pinned scheduling) which dramatically affects fairness on heterogeneous multi-cores. In this paper, we propose fairness-aware scheduling for single-ISA heterogeneous multi-cores, and explore two flavors for doing so. Equal-time scheduling runs each thread or workload on each core type for an equal fraction of the time, whereas equal-progress scheduling strives at getting equal amounts of work done on each core type. Our experimental results demonstrate an average 14% (and up to 25%) performance improvement over pinned scheduling through fairness-aware scheduling for homogeneous multi-threaded workloads; equal-progress scheduling improves performance by 32% on average for heterogeneous multi-threaded workloads. Further, we report dramatic improvements in fairness over prior scheduling proposals for multi-program workloads, while achieving system throughput comparable to throughput-optimized scheduling, and an average 21% improvement in throughput over pinned scheduling

    HeteroCore GPU to exploit TLP-resource diversity

    Get PDF

    POSTER: Exploiting asymmetric multi-core processors with flexible system sofware

    Get PDF
    Energy efficiency has become the main challenge for high performance computing (HPC). The use of mobile asymmetric multi-core architectures to build future multi-core systems is an approach towards energy savings while keeping high performance. However, it is not known yet whether such systems are ready to handle parallel applications. This paper fills this gap by evaluating emerging parallel applications on an asymmetric multi-core. We make use of the PARSEC benchmark suite and a processor that implements the ARM big.LITTLE architecture. We conclude that these applications are not mature enough to run on such systems, as they suffer from load imbalance. Furthermore, we explore the behaviour of dynamic scheduling solutions on either the Operating System (OS) or the runtime level. Comparing these approaches shows us that the most efficient scheduling takes place in the runtime level, influencing the future research towards such solutions.This work has been supported by the Spanish Government (SEV2015-0493), by the Spanish Ministry of Science and Innovation (contracts TIN2015-65316-P), by Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), by the RoMoL ERC Advanced Grant (GA 321253) and the European HiPEAC Network of Excellence. The Mont-Blanc project receives funding from the EU's Seventh Framework Programme (FP7/2007-2013) under grant agreement number 610402 and from the EU's H2020 Framework Programme (H2020/2014-2020) under grant agreement number 671697. M. Moretó has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship number JCI-2012-15047. M. Casas is supported by the Secretary for Universities and Research of the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the 7th R&D Framework Programme of the European Union (Contract 2013 BP B 00243).Peer ReviewedPostprint (author's final draft

    COLAB:A Collaborative Multi-factor Scheduler for Asymmetric Multicore Processors

    Get PDF
    Funding: Partially funded by the UK EPSRC grants Discovery: Pattern Discovery and Program Shaping for Many-core Systems (EP/P020631/1) and ABC: Adaptive Brokerage for Cloud (EP/R010528/1); Royal Academy of Engineering under the Research Fellowship scheme.Increasingly prevalent asymmetric multicore processors (AMP) are necessary for delivering performance in the era of limited power budget and dark silicon. However, the software fails to use them efficiently. OS schedulers, in particular, handle asymmetry only under restricted scenarios. We have efficient symmetric schedulers, efficient asymmetric schedulers for single-threaded workloads, and efficient asymmetric schedulers for single program workloads. What we do not have is a scheduler that can handle all runtime factors affecting AMP for multi-threaded multi-programmed workloads. This paper introduces the first general purpose asymmetry-aware scheduler for multi-threaded multi-programmed workloads. It estimates the performance of each thread on each type of core and identifies communication patterns and bottleneck threads. The scheduler then makes coordinated core assignment and thread selection decisions that still provide each application its fair share of the processor's time. We evaluate our approach using the GEM5 simulator on four distinct big.LITTLE configurations and 26 mixed workloads composed of PARSEC and SPLASH2 benchmarks. Compared to the state-of-the art Linux CFS and AMP-aware schedulers, we demonstrate performance gains of up to 25% and 5% to 15% on average depending on the hardware setup.Postprin
    • …
    corecore