98 research outputs found

    TaskInsight: Understanding Task Schedules Effects on Memory and Performance

    Get PDF
    Recent scheduling heuristics for task-based applications have managed to improve their by taking into account memory-related properties such as data locality and cache sharing. However, there is still a general lack of tools that can provide insights into why, and where, different schedulers improve memory behavior, and how this is related to the applications' performance. To address this, we present TaskInsight, a technique to characterize the memory behavior of different task schedulers through the analysis of data reuse between tasks. TaskInsight provides high-level, quantitative information that can be correlated with tasks' performance variation over time to understand data reuse through the caches due to scheduling choices. TaskInsight is useful to diagnose and identify which scheduling decisions affected performance, when were they taken, and why the performance changed, both in single and multi-threaded executions. We demonstrate how TaskInsight can diagnose examples where poor scheduling caused over 10% difference in performance for tasks of the same type, due to changes in the tasks' data reuse through the private and shared caches, in single and multi-threaded executions of the same application. This flexible insight is key for optimization in many contexts, including data locality, throughput, memory footprint or even energy efficiency.We thank the reviewers for their feedback. This work was supported by the Swedish Research Council, the Swedish Foundation for Strategic Research project FFL12-0051 and carried out within the Linnaeus Centre of Excellence UPMARC, Uppsala Programming for Multicore Architectures Research Center. This paper was also published with the support of the HiPEAC network that received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement no. 687698.Peer ReviewedPostprint (published version

    Intelligent Scheduling and Memory Management Techniques for Modern GPU Architectures

    Get PDF
    abstract: With the massive multithreading execution feature, graphics processing units (GPUs) have been widely deployed to accelerate general-purpose parallel workloads (GPGPUs). However, using GPUs to accelerate computation does not always gain good performance improvement. This is mainly due to three inefficiencies in modern GPU and system architectures. First, not all parallel threads have a uniform amount of workload to fully utilize GPU’s computation ability, leading to a sub-optimal performance problem, called warp criticality. To mitigate the degree of warp criticality, I propose a Criticality-Aware Warp Acceleration mechanism, called CAWA. CAWA predicts and accelerates the critical warp execution by allocating larger execution time slices and additional cache resources to the critical warp. The evaluation result shows that with CAWA, GPUs can achieve an average of 1.23x speedup. Second, the shared cache storage in GPUs is often insufficient to accommodate demands of the large number of concurrent threads. As a result, cache thrashing is commonly experienced in GPU’s cache memories, particularly in the L1 data caches. To alleviate the cache contention and thrashing problem, I develop an instruction aware Control Loop Based Adaptive Bypassing algorithm, called Ctrl-C. Ctrl-C learns the cache reuse behavior and bypasses a portion of memory requests with the help of feedback control loops. The evaluation result shows that Ctrl-C can effectively improve cache utilization in GPUs and achieve an average of 1.42x speedup for cache sensitive GPGPU workloads. Finally, GPU workloads and the co-located processes running on the host chip multiprocessor (CMP) in a heterogeneous system setup can contend for memory resources in multiple levels, resulting in significant performance degradation. To maximize the system throughput and balance the performance degradation of all co-located applications, I design a scalable performance degradation predictor specifically for heterogeneous systems, called HeteroPDP. HeteroPDP predicts the application execution time and schedules OpenCL workloads to run on different devices based on the optimization goal. The evaluation result shows HeteroPDP can improve the system fairness from 24% to 65% when an OpenCL application is co-located with other processes, and gain an additional 50% speedup compared with always offloading the OpenCL workload to GPUs. In summary, this dissertation aims to provide insights for the future microarchitecture and system architecture designs by identifying, analyzing, and addressing three critical performance problems in modern GPUs.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Performance and Power Analysis of HPC Workloads on Heterogenous Multi-Node Clusters

    Get PDF
    Performance analysis tools allow application developers to identify and characterize the inefficiencies that cause performance degradation in their codes, allowing for application optimizations. Due to the increasing interest in the High Performance Computing (HPC) community towards energy-efficiency issues, it is of paramount importance to be able to correlate performance and power figures within the same profiling and analysis tools. For this reason, we present a performance and energy-efficiency study aimed at demonstrating how a single tool can be used to collect most of the relevant metrics. In particular, we show how the same analysis techniques can be applicable on different architectures, analyzing the same HPC application on a high-end and a low-power cluster. The former cluster embeds Intel Haswell CPUs and NVIDIA K80 GPUs, while the latter is made up of NVIDIA Jetson TX1 boards, each hosting an Arm Cortex-A57 CPU and an NVIDIA Tegra X1 Maxwell GPU.The research leading to these results has received funding from the European Community’s Seventh Framework Programme [FP7/2007-2013] and Horizon 2020 under the Mont-Blanc projects [17], grant agreements n. 288777, 610402 and 671697. E.C. was partially founded by “Contributo 5 per mille assegnato all’Università degli Studi di Ferrara-dichiarazione dei redditi dell’anno 2014”. We thank the University of Ferrara and INFN Ferrara for the access to the COKA Cluster. We warmly thank the BSC tools group, supporting us for the smooth integration and test of our setup within Extrae and Paraver.Peer ReviewedPostprint (published version

    End-to-End Application Cloning for Distributed Cloud Microservices with Ditto

    Full text link
    We present Ditto, an automated framework for cloning end-to-end cloud applications, both monolithic and microservices, which captures I/O and network activity, as well as kernel operations, in addition to application logic. Ditto takes a hierarchical approach to application cloning, starting with capturing the dependency graph across distributed services, to recreating each tier's control/data flow, and finally generating system calls and assembly that mimics the individual applications. Ditto does not reveal the logic of the original application, facilitating publicly sharing clones of production services with hardware vendors, cloud providers, and the research community. We show that across a diverse set of single- and multi-tier applications, Ditto accurately captures their CPU and memory characteristics as well as their high-level performance metrics, is portable across platforms, and facilitates a wide range of system studies

    Performance and power comparisons between Fermi and Cypress GPUs

    Get PDF
    In recent years, modern graphics processing units have been widely adopted in high performance computing areas to solve large scale computation problems. The leading GPU manufacturers Nvidia and AMD have introduced series of products to the market. While sharing many similar design concepts, GPUs from these two manufacturers differ in several aspects on processor cores and the memory subsystem. In this work, we conduct a comprehensive study to characterize and compare the architectural features of Nvidia’s Fermi and AMD’s Cypress GPUs. We first investigate the performance and power consumptions of an AMD Cypress GPU. By employing a rigorous statistical model to analyze the execution behaviors of representative general-purpose GPU (GPGPU) applications, we conduct insightful investigations on the target GPU architecture. Our results demonstrate that the GPU execution throughput and the power dissipation are dependent on different architectural variables. Furthermore, we design a set of micro-benchmarks to study the power consumption features of different function units on the GPU. Based on those results, we derive instructive principles that can guide the design of power-efficient high performance computing systems. We then make the concentration shift to the Nvidia Fermi GPU and compare it with the product from AMD. Our results indicate that these two products have diverse advantages that are reflected in their performance for different sets of applications. In addition, we also compare the energy efficiencies of these two platforms since power/energy consumption is a major concern in the high performance computing system

    A Retargetable System-Level DBT Hypervisor

    Get PDF
    System-level Dynamic Binary Translation (DBT) provides the capability to boot an Operating System (OS) and execute programs compiled for an Instruction Set Architecture (ISA) different to that of the host machine. Due to their performance critical nature, system-level DBT frameworks are typically hand-coded and heavily optimized, both for their guest and host architectures. While this results in good performance of the DBT system, engineering costs for supporting a new, or extending an existing architecture are high. In this paper we develop a novel, retargetable DBT hypervisor, which includes guest specific modules generated from high-level guest machine specifications. Our system simplifies retargeting of the DBT, but it also delivers performance levels in excess of existing manually created DBT solutions. We achieve this by combining offline and online optimizations, and exploiting the freedom of a Just-in-time (JIT) compiler operating in a bare-metal environment provided by a Virtual Machine (VM) hypervisor. We evaluate our DBT using both targeted micro-benchmarks as well as standard application benchmarks, and we demonstrate its ability to outperform the de-facto standard QEMU DBT system. Our system delivers an average speedup of 2.21× over QEMU across SPEC CPU2006 integer benchmarks running in a full-system Linux OS environment, compiled for the 64-bit ARMv8-A ISA and hosted on an x86-64 platform. For floating-point applications the speedup is even higher, reaching 6.49× on average. We demonstrate that our system-level DBT system significantly reduces the effort required to support a new ISA, while delivering outstanding performance.Publisher PD

    Mechanistic analytical modeling of superscalar in-order processor performance

    Get PDF
    Superscalar in-order processors form an interesting alternative to out-of-order processors because of their energy efficiency and lower design complexity. However, despite the reduced design complexity, it is nontrivial to get performance estimates or insight in the application--microarchitecture interaction without running slow, detailed cycle-level simulations, because performance highly depends on the order of instructions within the application’s dynamic instruction stream, as in-order processors stall on interinstruction dependences and functional unit contention. To limit the number of detailed cycle-level simulations needed during design space exploration, we propose a mechanistic analytical performance model that is built from understanding the internal mechanisms of the processor. The mechanistic performance model for superscalar in-order processors is shown to be accurate with an average performance prediction error of 3.2% compared to detailed cycle-accurate simulation using gem5. We also validate the model against hardware, using the ARM Cortex-A8 processor and show that it is accurate within 10% on average. We further demonstrate the usefulness of the model through three case studies: (1) design space exploration, identifying the optimum number of functional units for achieving a given performance target; (2) program--machine interactions, providing insight into microarchitecture bottlenecks; and (3) compiler--architecture interactions, visualizing the impact of compiler optimizations on performance

    SyRA: early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems

    Get PDF
    © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes,creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Cross-layer reliability is becoming the preferred solution when reliability is a concern in the design of a microprocessor-based system. Nevertheless, deciding how to distribute the error management across the different layers of the system is a very complex task that requires the support of dedicated frameworks for cross-layer reliability analysis. This paper proposes SyRA, a system-level cross-layer early reliability analysis framework for radiation induced soft errors in memory arrays of microprocessor-based systems. The framework exploits a multi-level hybrid Bayesian model to describe the target system and takes advantage of Bayesian inference to estimate different reliability metrics. SyRA implements several mechanisms and features to deal with the complexity of realistic models and implements a complete tool-chain that scales efficiently with the complexity of the system. The simulation time is significantly lower than micro-architecture level or RTL fault-injection experiments with an accuracy high enough to take effective design decisions. To demonstrate the capability of SyRA, we analyzed the reliability of a set of microprocessor-based systems characterized by different microprocessor architectures (i.e., Intel x86, ARM Cortex-A15, ARM Cortex-A9) running both the Linux operating system or bare metal. Each system under analysis executes different software workloads both from benchmark suites and from real applications.Peer ReviewedPostprint (author's final draft
    • 

    corecore