19 research outputs found

    Intelligent Management of Inter-Thread Synchronization Dependencies for Concurrent Programs.

    Full text link
    Power dissipation limits and design complexity have made the microprocessor industry less successful in improving the performance of monolithic processors, even though semiconductor technology continues to scale. Consequently, chip multiprocessors (CMPs) have become a standard for all ranges of computing from cellular phones to high-performance servers. As sufficient thread level parallelism (TLP) is necessary to exploit the computational power provided by CMPs, most performance-aware programmers need to parallelize their programs. For shared memory multi-threaded programs, synchronization mechanisms such as mutexes, barriers, and condition variables, are used to enforce the threads to interact with each other in the way the programmers intended. However, employing synchronization operations in both correct and efficient way at the same time is extremely difficult, and there have been trade-offs between programmability and efficiency of using synchronizations. This thesis proposes a collection of works that increase the programmability and efficiency of concurrent programs by intelligently managing the synchronization operations. First, we focus on mutex locks and unlocks. Many concurrency bug detection tools and automated bug fixers rely on the precise identification of critical sections guarded by lock/unlock operations. We suggest a practical lock/unlock pairing mechanism that combines static analysis with dynamic instrumentation to identify critical sections in POSIX multi-threaded C/C++ programs. Second, we present Dynamic Core Boosting (DCB) to accelerate critical paths in multi-thread programs. Inter-thread dependencies through synchronizations form critical paths. These critical paths are major performance bottlenecks for concurrent programs, and they are exacerbated by workload imbalances in performance asymmetric CMPs. DCB coordinates its compiler, runtime subsystem, and architecture to mitigates such performance bottlenecks. Finally, we propose exploiting synchronization operations for better energy efficiency through dynamic power management.PhDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/108886/1/netforce_1.pd

    Adaptive Knobs for Resource Efficient Computing

    Get PDF
    Performance demands of emerging domains such as artificial intelligence, machine learning and vision, Internet-of-things etc., continue to grow. Meeting such requirements on modern multi/many core systems with higher power densities, fixed power and energy budgets, and thermal constraints exacerbates the run-time management challenge. This leaves an open problem on extracting the required performance within the power and energy limits, while also ensuring thermal safety. Existing architectural solutions including asymmetric and heterogeneous cores and custom acceleration improve performance-per-watt in specific design time and static scenarios. However, satisfying applications’ performance requirements under dynamic and unknown workload scenarios subject to varying system dynamics of power, temperature and energy requires intelligent run-time management. Adaptive strategies are necessary for maximizing resource efficiency, considering i) diverse requirements and characteristics of concurrent applications, ii) dynamic workload variation, iii) core-level heterogeneity and iv) power, thermal and energy constraints. This dissertation proposes such adaptive techniques for efficient run-time resource management to maximize performance within fixed budgets under unknown and dynamic workload scenarios. Resource management strategies proposed in this dissertation comprehensively consider application and workload characteristics and variable effect of power actuation on performance for pro-active and appropriate allocation decisions. Specific contributions include i) run-time mapping approach to improve power budgets for higher throughput, ii) thermal aware performance boosting for efficient utilization of power budget and higher performance, iii) approximation as a run-time knob exploiting accuracy performance trade-offs for maximizing performance under power caps at minimal loss of accuracy and iv) co-ordinated approximation for heterogeneous systems through joint actuation of dynamic approximation and power knobs for performance guarantees with minimal power consumption. The approaches presented in this dissertation focus on adapting existing mapping techniques, performance boosting strategies, software and dynamic approximations to meet the performance requirements, simultaneously considering system constraints. The proposed strategies are compared against relevant state-of-the-art run-time management frameworks to qualitatively evaluate their efficacy

    Investigation into scalable energy and performance models for many-core systems

    Get PDF
    PhD ThesisIt is likely that many-core processor systems will continue to penetrate emerging embedded and high-performance applications. Scalable energy and performance models are two critical aspects that provide insights into the conflicting trade-offs between them with growing hardware and software complexity. Traditional performance models, such as Amdahl’s Law, Gustafson’s and Sun-Ni’s, have helped the research community and industry to better understand the system performance bounds with given processing resources, which is otherwise known as speedup. However, these models and their existing extensions have limited applicability for energy and/or performance-driven system optimization in practical systems. For instance, these are typically based on software characteristics, assuming ideal and homogeneous hardware platforms or limited forms of processor heterogeneity. In addition, the measurement of speedup and parallelization factors of an application running on a specific hardware platform require instrumenting the original software codes. Indeed, practical speedup and parallelizability models of application workloads running on modern heterogeneous hardware are critical for energy and performance models, as they can be used to inform design and control decisions with an aim to improve system throughput and energy efficiency. This thesis addresses the limitations by firstly developing novel and scalable speedup and energy consumption models based on a more general representation of heterogeneity, referred to as the normal form heterogeneity. A method is developed whereby standard performance counters found in modern many-core platforms can be used to derive speedup, and therefore the parallelizability of the software, without instrumenting applications. This extends the usability of the new models to scenarios where the parallelizability of software is unknown, leading to potentially Run-Time Management (RTM) speedup and/or energy efficiency optimization. The models and optimization methods presented in this thesis are validated through extensive experimentation, by running a number of different applications in wide-ranging concurrency scenarios on a number of different homogeneous and heterogeneous Multi/Many Core Processor (M/MCP) systems. These include homogeneous and heterogeneous architectures and viii range from existing off-the-shelf platforms to potential future system extensions. The practical use of these models and methods is demonstrated through real examples such as studying the effectiveness of the system load balancer. The models and methodologies proposed in this thesis provide guidance to a new opportunities for improving the energy efficiency of M/MCP systemsHigher Committee of Education Development (HCED) in Ira

    On Energy Efficient Computing Platforms

    Get PDF
    In accordance with the Moore's law, the increasing number of on-chip integrated transistors has enabled modern computing platforms with not only higher processing power but also more affordable prices. As a result, these platforms, including portable devices, work stations and data centres, are becoming an inevitable part of the human society. However, with the demand for portability and raising cost of power, energy efficiency has emerged to be a major concern for modern computing platforms. As the complexity of on-chip systems increases, Network-on-Chip (NoC) has been proved as an efficient communication architecture which can further improve system performances and scalability while reducing the design cost. Therefore, in this thesis, we study and propose energy optimization approaches based on NoC architecture, with special focuses on the following aspects. As the architectural trend of future computing platforms, 3D systems have many bene ts including higher integration density, smaller footprint, heterogeneous integration, etc. Moreover, 3D technology can signi cantly improve the network communication and effectively avoid long wirings, and therefore, provide higher system performance and energy efficiency. With the dynamic nature of on-chip communication in large scale NoC based systems, run-time system optimization is of crucial importance in order to achieve higher system reliability and essentially energy efficiency. In this thesis, we propose an agent based system design approach where agents are on-chip components which monitor and control system parameters such as supply voltage, operating frequency, etc. With this approach, we have analysed the implementation alternatives for dynamic voltage and frequency scaling and power gating techniques at different granularity, which reduce both dynamic and leakage energy consumption. Topologies, being one of the key factors for NoCs, are also explored for energy saving purpose. A Honeycomb NoC architecture is proposed in this thesis with turn-model based deadlock-free routing algorithms. Our analysis and simulation based evaluation show that Honeycomb NoCs outperform their Mesh based counterparts in terms of network cost, system performance as well as energy efficiency.Siirretty Doriast

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Embedded System Design

    Get PDF
    A unique feature of this open access textbook is to provide a comprehensive introduction to the fundamental knowledge in embedded systems, with applications in cyber-physical systems and the Internet of things. It starts with an introduction to the field and a survey of specification models and languages for embedded and cyber-physical systems. It provides a brief overview of hardware devices used for such systems and presents the essentials of system software for embedded systems, including real-time operating systems. The author also discusses evaluation and validation techniques for embedded systems and provides an overview of techniques for mapping applications to execution platforms, including multi-core platforms. Embedded systems have to operate under tight constraints and, hence, the book also contains a selected set of optimization techniques, including software optimization techniques. The book closes with a brief survey on testing. This fourth edition has been updated and revised to reflect new trends and technologies, such as the importance of cyber-physical systems (CPS) and the Internet of things (IoT), the evolution of single-core processors to multi-core processors, and the increased importance of energy efficiency and thermal issues

    Methoden und Beschreibungssprachen zur Modellierung und Verifikation vonSchaltungen und Systemen: MBMV 2015 - Tagungsband, Chemnitz, 03. - 04. März 2015

    Get PDF
    Der Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 2015) findet nun schon zum 18. mal statt. Ausrichter sind in diesem Jahr die Professur Schaltkreis- und Systementwurf der Technischen Universität Chemnitz und das Steinbeis-Forschungszentrum Systementwurf und Test. Der Workshop hat es sich zum Ziel gesetzt, neueste Trends, Ergebnisse und aktuelle Probleme auf dem Gebiet der Methoden zur Modellierung und Verifikation sowie der Beschreibungssprachen digitaler, analoger und Mixed-Signal-Schaltungen zu diskutieren. Er soll somit ein Forum zum Ideenaustausch sein. Weiterhin bietet der Workshop eine Plattform für den Austausch zwischen Forschung und Industrie sowie zur Pflege bestehender und zur Knüpfung neuer Kontakte. Jungen Wissenschaftlern erlaubt er, ihre Ideen und Ansätze einem breiten Publikum aus Wissenschaft und Wirtschaft zu präsentieren und im Rahmen der Veranstaltung auch fundiert zu diskutieren. Sein langjähriges Bestehen hat ihn zu einer festen Größe in vielen Veranstaltungskalendern gemacht. Traditionell sind auch die Treffen der ITGFachgruppen an den Workshop angegliedert. In diesem Jahr nutzen zwei im Rahmen der InnoProfile-Transfer-Initiative durch das Bundesministerium für Bildung und Forschung geförderte Projekte den Workshop, um in zwei eigenen Tracks ihre Forschungsergebnisse einem breiten Publikum zu präsentieren. Vertreter der Projekte Generische Plattform für Systemzuverlässigkeit und Verifikation (GPZV) und GINKO - Generische Infrastruktur zur nahtlosen energetischen Kopplung von Elektrofahrzeugen stellen Teile ihrer gegenwärtigen Arbeiten vor. Dies bereichert denWorkshop durch zusätzliche Themenschwerpunkte und bietet eine wertvolle Ergänzung zu den Beiträgen der Autoren. [... aus dem Vorwort

    GPU PERFORMANCE MODELLING AND OPTIMIZATION

    Get PDF
    Ph.DNUS-TU/E JOINT PH.D

    Energy-efficient mobile GPU systems

    Get PDF
    The design of mobile GPUs is all about saving energy. Smartphones and tablets are battery-operated and thus any type of rendering needs to use as little energy as possible. Furthermore, smartphones do not include sophisticated cooling systems due to their small size, making heat dissipation a primary concern. Improving the energy-efficiency of mobile GPUs will be absolutely necessary to achieve the performance required to satisfy consumer expectations, while maintaining operating time per battery charge and keeping the GPU in its thermal limits. The first step in optimizing energy consumption is to identify the sources of energy drain. Previous studies have demonstrated that the register file is one of the main sources of energy consumption in a GPU. As graphics workloads are highly data- and memory-parallel, GPUs rely on massive multithreading to hide the memory latency and keep the functional units busy. However, aggressive multithreading requires a huge register file to keep the registers of thousands of simultaneous threads. Such a big register file exceeds the power budget typically available for an embedded graphics processors and, hence, more energy-efficient memory latency tolerance techniques are necessary. On the other hand, prior research showed that the off-chip accesses to system memory are one of the most expensive operations in terms of energy in a mobile GPU. Therefore, optimizing memory bandwidth usage is a primary concern in mobile GPU design. Many bandwidth saving techniques, such as texture compression or ARM's transaction elimination, have been proposed in both industry and academia. The purpose of this thesis is to study the characteristics of mobile graphics processors and mobile workloads in order to propose different energy saving techniques specifically tailored for the low-power segment. Firstly, we focus on energy-efficient memory latency tolerance. We analyze several techniques such as multithreading and prefetching and conclude that they are effective but not energy-efficient. Next, we propose an architecture for the fragment processors of a mobile GPU that is based on the decoupled access/execute paradigm. The results obtained by using a cycle-accurate mobile GPU simulator and several commercial Android games show that the decoupled architecture combined with a small degree of multithreading provides the most energy efficient solution for hiding memory latency. More specifically, the decoupled access/execute-like design with just 4 SIMD threads/processor is able to achieve 97% of the performance of a larger GPU with 16 SIMD threads/processor, while providing 20.5% energy savings on average. Secondly, we focus on optimizing memory bandwidth in a mobile GPU. We analyze the bandwidth usage in a set of commercial Android games and find that most of the bandwidth is employed for fetching textures, and also that consecutive frames share most of the texture dataset as they tend to be very similar. However, the GPU cannot capture inter-frame texture re-use due to the big size of the texture dataset for one frame. Based on this analysis, we propose Parallel Frame Rendering (PFR), a technique that overlaps the processing of multiple frames in order to exploit inter-frame texture re-use and save bandwidth. By processing multiple frames in parallel textures are fetched once every two frames instead of being fetched in a frame basis as in conventional GPUs. PFR provides 23.8% memory bandwidth savings on average in our set of Android games, that result in 12% speedup and 20.1% energy savings. Finally, we improve PFR by introducing a hardware memoization system on top. We analyze the redundancy in mobile games and find that more than 38% of the Fragment Program executions are redundant on average. We thus propose a task-level hardware-based memoization system that provides 15% speedup and 12% energy savings on average over a PFR-enabled GPU.El diseño de las GPUs (Graphics Procesing Units) móviles se centra fundamentalmente en el ahorro energético. Los smartphones y las tabletas son dispositivos alimentados mediante baterías y, por lo tanto, cualquier tipo de renderizado debe utilizar la menor cantidad de energía posible. Mejorar la eficiencia energética de las GPUs móviles será absolutamente necesario para alcanzar el rendimiento requirido para satisfacer las expectativas de los usuarios, sin reducir el tiempo de vida de la batería. El primer paso para optimizar el consumo energético consiste en identificar qué componentes son los principales consumidores de la batería. Estudios anteriores han identificado al banco de registros y a los accessos a memoria principal como las mayores fuentes de consumo energético en una GPU. El propósito de esta tesis es estudiar las características de los procesadores gráficos móviles y de las aplicaciones móviles con el objetivo de proponer distintas técnicas de ahorro energético. En primer lugar, la investigación se centra en desarrollar métodos energéticamente eficientes para ocultar la latencia de la memoria principal. El resultado de la investigación es una arquitectura desacoplada para los Fragment Processors de la GPU. Los resultados experimentales utilizando un simulador de ciclo y distintos juegos de Android muestran que una arquitectura desacoplada, combinada con un nivel de multithreading moderado, proporciona la solución más eficiente desde el punto de vista energético para ocultar la latencia de la memoria prinicipal. Más específicamente, la arquitectura desacoplada con sólo 4 SIMD threads/processor es capaz de alcanzar el 97% del rendimiento de una GPU más grande con 16 SIMD threads/processor, al tiempo que se reduce el consumo energético en un 20.5%. En segundo lugar, el trabajo de investigación se centró en optimizar el ancho de banda en una GPU móvil. Se realizó un estudio del uso del ancho de banda en distintos juegos de Android y se observó que la mayor parte del ancho de banda se utiliza para leer texturas. Además, se observó que frames consecutivos comparten una gran parte de las texturas. Sin embargo, la GPU no puede capturar el reuso de texturas entre frames dado que el tamaño de las texturas utilizadas por un frame es mucho mayor que la caché de segundo nivel. Basándose en este análisis, se desarrolló Parallel Frame Rendering (PFR), una técnica que solapa el procesado de multiples frames consecutivos con el objetivo de explotar el reuso de texturas entre frames y ahorrar así ancho de bando. Al procesar múltiples frames en paralelo las texturas se leen de memoria principal una vez cada dos frames en lugar de leerse en cada frame como sucede en una GPU convencional. PFR proporciona un ahorro del 23.8% en ancho de banda en promedio para distintos juegos de Android, este ahorro de ancho de banda redunda en un incremento del rendimiento del 12% y un ahorro energético del 20.1%. Por último, se mejoró PFR introduciendo un sistema hardware capaz de evitar cómputos redundantes. Un análisis de distintos juegos de Android reveló que más de un 38% de las ejecuciones del Fragment Program eran redundantes en promedio. Así pues, se propuso un sistema hardware capaz de identificar y eliminar parte de los cómputos y accessos a memoria redundantes, dicho sistema proporciona un incremento del rendimiento del 15% y un ahorro energético del 12% en promedio con respecto a una GPU móvil basada en PFR
    corecore