120 research outputs found

    Parallel architectures and runtime systems co-design for task-based programming models

    Get PDF
    The increasing parallelism levels in modern computing systems has extolled the need for a holistic vision when designing multiprocessor architectures taking in account the needs of the programming models and applications. Nowadays, system design consists of several layers on top of each other from the architecture up to the application software. Although this design allows to do a separation of concerns where it is possible to independently change layers due to a well-known interface between them, it is hampering future systems design as the Law of Moore reaches to an end. Current performance improvements on computer architecture are driven by the shrinkage of the transistor channel width, allowing faster and more power efficient chips to be made. However, technology is reaching physical limitations were the transistor size will not be able to be reduced furthermore and requires a change of paradigm in systems design. This thesis proposes to break this layered design, and advocates for a system where the architecture and the programming model runtime system are able to exchange information towards a common goal, improve performance and reduce power consumption. By making the architecture aware of runtime information such as a Task Dependency Graph (TDG) in the case of dataflow task-based programming models, it is possible to improve power consumption by exploiting the critical path of the graph. Moreover, the architecture can provide hardware support to create such a graph in order to reduce the runtime overheads and making possible the execution of fine-grained tasks to increase the available parallelism. Finally, the current status of inter-node communication primitives can be exposed to the runtime system in order to perform a more efficient communication scheduling, and also creates new opportunities of computation and communication overlap that were not possible before. An evaluation of the proposals introduced in this thesis is provided and a methodology to simulate and characterize the application behavior is also presented.El aumento del paralelismo proporcionado por los sistemas de cómputo modernos ha provocado la necesidad de una visión holística en el diseño de arquitecturas multiprocesador que tome en cuenta las necesidades de los modelos de programación y las aplicaciones. Hoy en día el diseño de los computadores consiste en diferentes capas de abstracción con una interfaz bien definida entre ellas. Las limitaciones de esta aproximación junto con el fin de la ley de Moore limitan el potencial de los futuros computadores. La mayoría de las mejoras actuales en el diseño de los computadores provienen fundamentalmente de la reducción del tamaño del canal del transistor, lo cual permite chips más rápidos y con un consumo eficiente sin apenas cambios fundamentales en el diseño de la arquitectura. Sin embargo, la tecnología actual está alcanzando limitaciones físicas donde no será posible reducir el tamaño de los transistores motivando así un cambio de paradigma en la construcción de los computadores. Esta tesis propone romper este diseño en capas y abogar por un sistema donde la arquitectura y el sistema de tiempo de ejecución del modelo de programación sean capaces de intercambiar información para alcanzar una meta común: La mejora del rendimiento y la reducción del consumo energético. Haciendo que la arquitectura sea consciente de la información disponible en el modelo de programación, como puede ser el grafo de dependencias entre tareas en los modelos de programación dataflow, es posible reducir el consumo energético explotando el camino critico del grafo. Además, la arquitectura puede proveer de soporte hardware para crear este grafo con el objetivo de reducir el overhead de construir este grado cuando la granularidad de las tareas es demasiado fina. Finalmente, el estado de las comunicaciones entre nodos puede ser expuesto al sistema de tiempo de ejecución para realizar una mejor planificación de las comunicaciones y creando nuevas oportunidades de solapamiento entre cómputo y comunicación que no eran posibles anteriormente. Esta tesis aporta una evaluación de todas estas propuestas, así como una metodología para simular y caracterizar el comportamiento de las aplicacionesPostprint (published version

    온 칩 네트워크 설계: 매핑, 관리, 라우팅

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·정보공학부, 2016. 2. 최기영.지난 수십 년간 이어진 반도체 기술의 향상은 매니 코어의 시대를 가져다 주었다. 우리가 일상 생활에 쓰는 데스크톱 컴퓨터조차도 이미 수 개의 코어를 가지고 있으며, 수백 개의 코어를 가진 칩도 상용화되어 있다. 이러한 많은 코어들 간의 통신 기반으로서, 네트워크-온-칩(NoC)이 새로이 대두되었으며, 이는 현재 많은 연구 및 상용 제품에서 널리 사용되고 있다. 그러나 네트워크-온-칩을 매니 코어 시스템에 사용하는 데에는 여러 가지 문제가 따르며, 본 논문에서는 그 중 몇 가지를 풀어내고자 하였다. 본 논문의 두 번째 챕터에서는 NoC 기반 매니코어 구조에 작업을 할당하고 스케쥴하는 방법을 다루었다. 매니코어에의 작업 할당을 다룬 논문은 이미 많이 출판되었지만, 본 연구는 메시지 패싱과 공유 메모리, 두 가지의 통신 방식을 고려함으로써 성능과 에너지 효율을 개선하였다. 또한, 본 연구는 역방향 의존성을 가진 작업 그래프를 스케쥴하는 방법 또한 제시하였다. 3차원 적층 기술은 높아진 전력 밀도 때문에 열 문제가 심각해지는 등, 여러 가지 도전 과제를 내포하고 있다. 세 번째 챕터에서는 DVFS 기술을 이용하여 열 문제를 완화하고자 하는 기술을 소개한다. 각 코어와 라우터가 전압, 작동 속도를 조절할 수 있는 구조에서, 가장 높은 성능을 이끌어 내면서도 최대 온도를 넘어서지 않도록 한다. 세 번째와 네 번째 챕터는 조금 다른 측면을 다룬다. 3D 적층 기술을 사용할 때, 층간 통신은 주로 TSV를 이용하여 이루어진다. 그러나 TSV는 일반 wire보다 훨씬 큰 면적을 차지하기 때문에, 전체 네트워크에서의 TSV 개수는 제한되어야 할 경우가 많다. 이 경우에는 두 가지 선택지가 있는데, 첫째는 각 층간 통신 채널의 대역폭을 줄이는 것이고, 둘째는 각 채널의 대역폭은 유지하되 일부 노드만 층간 통신이 가능한 채널을 제공하는 것이다. 우리는 각각의 경우에 대하여 라우팅 알고리즘을 하나씩 제시한다. 첫 번째 경우에 있어서는 deflection 라우팅 기법을 사용하여 층간 통신의 긴 지연 시간을 극복하고자 하였다. 층간 통신을 균등하게 분배함으로써, 제시된 알고리즘은 개선된 지연 시간을 보이며 라우터 버퍼의 제거를 통한 면적 및 에너지 효율성 또한 얻을 수 있다. 두 번째 경우에서는 층간 통신 채널을 선택하기 위한 몇 가지 규칙을 제시한다. 약간의 라우팅 자유도를 희생함으로써, 제시된 알고리즘은 기존 알고리즘의 가상 채널 요구 조건을 제거하고, 결과적으로는 성능 또는 에너지 효율의 증가를 가져 온다.For decades, advance in semiconductor technology has led us to the era of many-core systems. Today's desktop computers already have multi-core processors, and chips with more than a hundred cores are commercially available. As a communication medium for such a large number of cores, network-on-chip (NoC) has emerged out, and now is being used by many researchers and companies. Adopting NoC for a many-core system incurs many problems, and this thesis tries to solve some of them. The second chapter of this thesis is on mapping and scheduling of tasks on NoC-based CMP architectures. Although mapping on NoC has a number of papers published, our work reveals that selecting communication types between shared memory and message passing can help improve the performance and energy efficiency. Additionally, our framework supports scheduling applications containing backward dependencies with the help of modified modulo scheduling. Evolving the SoCs through 3D stacking makes us face a number of new problems, and the thermal problem coming from increased power density is one of them. In the third chapter of this thesis, we try to mitigate the hotspot problem using DVFS techniques. Assuming that all the routers as well as cores have capabilities to control voltage and frequency individually, we find voltage-frequency pairs for all cores and routers which yields the best performance within the given thermal constraint. The fourth and the fifth chapters of this thesis are from a different aspect. In 3D stacking, inter-layer interconnections are implemented using through-silicon vias (TSV). TSVs usually take much more area than normal wires. Furthermore, they also consume silicon area as well as metal area. For this reason, designers would want to limit the number of TSVs used in their network. To limit the TSV count, there are two options: the first is to reduce the width of each vertical links, and the other is to use fewer vertical links, which results in a partially connected network. We present two routing methodologies for each case. For the network with reduced bandwidth vertical links, we propose using deflection routing to mitigate the long latency of vertical links. By balancing the vertical traffics properly, the algorithm provides improved latency. Also, a large amount of area and energy reduction can be obtained by the removal of router buffers. For partially connected networks, we introduce a set of routing rules for selecting the vertical links. At the expense of sacrificing some amount of routing freedom, the proposed algorithm removes the virtual channel requirement for avoiding deadlock. As a result, the performance, or energy consumption can be reduced at the designer's choice.Chapter 1 Introduction 1 1.1 Task Mapping and Scheduling 2 1.2 Thermal Management 3 1.3 Routing for 3D Networks 5 Chapter 2 Mapping and Scheduling 9 2.1 Introduction 9 2.2 Motivation 10 2.3 Background 12 2.4 Related Work 16 2.5 Platform Description 17 2.5.1 Architcture Description 17 2.5.2 Energy Model 21 2.5.3 Communication Delay Model 22 2.6 Problem Formulation 23 2.7 Proposed Solution 25 2.7.1 Task and Communication Mapping 27 2.7.2 Communication Type Optimization 31 2.7.3 Design Space Pruning via Pre-evaluation 34 2.7.4 Scheduling 35 2.8 Experimental Results 42 2.8.1 Experiments with Coarse-grained Iterative Modulo Scheduling 42 2.8.2 Comparison with Different Mapping Algorithms 43 2.8.3 Experiments with Overall Algorithms 45 2.8.4 Experiments with Various Local Memory Sizes 47 2.8.5 Experiments with Various Placements of Shared Memory 48 Chapter 3 Thermal Management 50 3.1 Introduction 50 3.2 Background 51 3.2.1 Thermal Modeling 51 3.2.2 Heterogeneity in Thermal Propagation 52 3.3 Motivation and Problem Definition 53 3.4 Related Work 56 3.5 Orchestrated Voltage-Frequency Assignment 56 3.5.1 Individual PI Control Method 56 3.5.2 PI Controlled Weighted-Power Budgeting 57 3.5.3 Performance/Power Estimation 59 3.5.4 Frequency Assignment 62 3.5.5 Algorithm Overview 64 3.5.6 Stability Conditions for PI Controller 65 3.6 Experimental Result 66 3.6.1 Experimental Setup 66 3.6.2 Overall Algorithm Performance 68 3.6.3 Accuracy of the Estimation Model 70 3.6.4 Performance of the Frequency Assignment Algorithm 70 Chapter 4 Routing for Limited Bandwidth 3D NoC 72 4.1 Introduction 72 4.2 Motivation 73 4.3 Background 74 4.4 Related Work 75 4.5 3D Deflection Routing 76 4.5.1 Serialized TSV Model 76 4.5.2 TSV Link Injection/ejection Scheme 78 4.5.3 Deadlock Avoidance 80 4.5.4 Livelock Avoidance 84 4.5.5 Router Architecture: Putting It All Together 86 4.5.6 System Level Consideration 87 4.6 Experimental Results 89 4.6.1 Experimental Setup 89 4.6.2 Results on Synthetic Traffic Patterns 91 4.6.3 Results on Realistic Traffic Patterns 94 4.6.4 Results on Real Application Benchmarks 98 4.6.5 Fairness Issue 103 4.6.6 Area Cost Comparison 104 Chapter 5 Routing for Partially Connected 3D NoC 106 5.1 Introduction 106 5.2 Background 107 5.3 Related Work 109 5.4 Proposed Algorithm 111 5.4.1 Preliminary 112 5.4.2 Routing Algorithm for 3-D Stacked Meshes with Regular Partial Vertical Connections 115 5.4.3 Routing Algorithm for 3-D Stacked Meshes with Irregular Partial Vertical Connections 118 5.4.4 Extension to Heterogeneous Mesh Layers 122 5.5 Experimental Results 126 5.5.1 Experimental Setup 126 5.5.2 Experiments on Synthetic Traffics 128 5.5.3 Experiments on Application Benchmarks 133 5.5.4 Comparison with Reduced Bandwidth Mesh 139 Chapter 6 Conclusion 141 Bibliography 144 초록 163Docto

    Energy-Efficient and Reliable Computing in Dark Silicon Era

    Get PDF
    Dark silicon denotes the phenomenon that, due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing in each technology generation. Moore’s law and Dennard scaling had been backed and coupled appropriately for five decades to bring commensurate exponential performance via single core and later muti-core design. However, recalculating Dennard scaling for recent small technology sizes shows that current ongoing multi-core growth is demanding exponential thermal design power to achieve linear performance increase. This process hits a power wall where raises the amount of dark or dim silicon on future multi/many-core chips more and more. Furthermore, from another perspective, by increasing the number of transistors on the area of a single chip and susceptibility to internal defects alongside aging phenomena, which also is exacerbated by high chip thermal density, monitoring and managing the chip reliability before and after its activation is becoming a necessity. The proposed approaches and experimental investigations in this thesis focus on two main tracks: 1) power awareness and 2) reliability awareness in dark silicon era, where later these two tracks will combine together. In the first track, the main goal is to increase the level of returns in terms of main important features in chip design, such as performance and throughput, while maximum power limit is honored. In fact, we show that by managing the power while having dark silicon, all the traditional benefits that could be achieved by proceeding in Moore’s law can be also achieved in the dark silicon era, however, with a lower amount. Via the track of reliability awareness in dark silicon era, we show that dark silicon can be considered as an opportunity to be exploited for different instances of benefits, namely life-time increase and online testing. We discuss how dark silicon can be exploited to guarantee the system lifetime to be above a certain target value and, furthermore, how dark silicon can be exploited to apply low cost non-intrusive online testing on the cores. After the demonstration of power and reliability awareness while having dark silicon, two approaches will be discussed as the case study where the power and reliability awareness are combined together. The first approach demonstrates how chip reliability can be used as a supplementary metric for power-reliability management. While the second approach provides a trade-off between workload performance and system reliability by simultaneously honoring the given power budget and target reliability

    Task Packing: Efficient task scheduling in unbalanced parallel programs to maximize CPU utilization

    Get PDF
    Load imbalance in parallel systems can be generated by external factors to the currently running applications like operating system noise or the underlying hardware like a heterogeneous cluster. HPC applications working on irregular data structures can also have difficulties to balance their computations across the parallel tasks. In this article we extend, improve and evaluate more deeply the Task Packing mechanism proposed in a previous work. The main idea of the mechanism is to concentrate the idle cycles of unbalanced applications in such a way that one or more CPUs are freed from execution. To achieve this, CPUs are stressed with just useful work of the parallel application tasks, provided performance is not degraded. The packing is solved by an algorithm based on the Knapsack problem, in a minimum number of CPUs and using oversubscription. We design and implement a more efficient version of such mechanism. To that end, we perform the Task Packing “in place”, taking advantage of idle cycles generated at synchronization points of unbalanced applications. Evaluations are carried out on a heterogeneous platform using FT and miniFE benchmarks. Results showed that our proposal generates low overhead. In addition the amount of freed CPUs are related to a load imbalance metric which can be used as a prediction for it.Peer ReviewedPostprint (author's final draft

    RUNTIME METHODS TO IMPROVE ENERGY EFFICIENCY IN SUPERCOMPUTING APPLICATIONS

    Get PDF
    Energy efficiency in supercomputing is critical to limit operating costs and carbon footprints. While the energy efficiency of future supercomputing centers needs to improve at all levels, the energy consumed by the processing units is a large fraction of the total energy consumed by High Performance Computing (HPC) systems. HPC applications use a parallel programming paradigm like the Message Passing Interface (MPI) to coordinate computation and communication among thousands of processors. With dynamically-changing factors both in hardware and software affecting energy usage of processors, there exists a need for power monitoring and regulation at runtime to achieve savings in energy. This dissertation highlights an adaptive runtime framework that enables processors with core-specific power control by dynamically adapting to workload characteristics to reduce power with little or no performance impact. Two opportunities to improve the energy efficiency of processors running MPI applications are identified - computational workload imbalance and waiting on memory. Monitoring of performance and power regulation is performed by the framework transparently within the MPI runtime system, eliminating the need for code changes to MPI applications. The effect of enforcing power limits (capping) on processors is also investigated. Experiments on 32 nodes (1024 cores) show that in presence of workload imbalance, the runtime reduces Central Processing Unit (CPU) frequency on cores not on the critical path, thereby reducing power and hence energy usage without deteriorating performance. Using this runtime, six MPI mini-applications and a full MPI application show an overall 20% decrease in energy use with less than 1% increase in execution time. In addition, the lowering of frequency on non-critical cores reduces run-to-run performance variation and improves performance. For the full application, an average speedup of 11% is seen, while the power is lowered by about 31% for an energy savings of up to 42%. Another experiment on 16 nodes (256 cores) that are power capped also shows performance improvement along with power reduction. Thus, energy optimization can also be a performance optimization. For applications that are limited by memory access times, memory metrics identified facilitate lowering of power by up to 32% without adversely impacting performance.Doctor of Philosoph

    A Survey of Research on Power Management Techniques for High Performance Systems

    Get PDF
    This paper surveys the research on power management techniques for high performance systems. These include both commercial high performance clusters and scientific high performance computing (HPC) systems. Power consumption has rapidly risen to an intolerable scale. This results in both high operating costs and high failure rates so it is now a major cause for concern. It is imposed new challenges to the development of high performance systems. In this paper, we first review the basic mechanisms that underlie power management techniques. Then we survey two fundamental techniques for power management: metrics and profiling. After that, we review the research for the two major types of high performance systems: commercial clusters and supercomputers. Based on this, we discuss the new opportunities and problems presented by the recent adoption of virtualization techniques, and again we present the most recent research on this. Finally, we summarise and discuss future research directions
    corecore