Search CORE

57 research outputs found

Resource and thermal management in 3D-stacked multi-/many-core systems

Author: Zhang Tiansheng
Publication venue
Publication date: 10/03/2017
Field of study

Continuous semiconductor technology scaling and the rapid increase in computational needs have stimulated the emergence of multi-/many-core processors. While up to hundreds of cores can be placed on a single chip, the performance capacity of the cores cannot be fully exploited due to high latencies of interconnects and memory, high power consumption, and low manufacturing yield in traditional (2D) chips. 3D stacking is an emerging technology that aims to overcome these limitations of 2D designs by stacking processor dies over each other and using through-silicon-vias (TSVs) for on-chip communication, and thus, provides a large amount of on-chip resources and shortens communication latency. These benefits, however, are limited by challenges in high power densities and temperatures. 3D stacking also enables integrating heterogeneous technologies into a single chip. One example of heterogeneous integration is building many-core systems with silicon-photonic network-on-chip (PNoC), which reduces on-chip communication latency significantly and provides higher bandwidth compared to electrical links. However, silicon-photonic links are vulnerable to on-chip thermal and process variations. These variations can be countered by actively tuning the temperatures of optical devices through micro-heaters, but at the cost of substantial power overhead. This thesis claims that unearthing the energy efficiency potential of 3D-stacked systems requires intelligent and application-aware resource management. Specifically, the thesis improves energy efficiency of 3D-stacked systems via three major components of computing systems: cache, memory, and on-chip communication. We analyze characteristics of workloads in computation, memory usage, and communication, and present techniques that leverage these characteristics for energy-efficient computing. This thesis introduces 3D cache resource pooling, a cache design that allows for flexible heterogeneity in cache configuration across a 3D-stacked system and improves cache utilization and system energy efficiency. We also demonstrate the impact of resource pooling on a real prototype 3D system with scratchpad memory. At the main memory level, we claim that utilizing heterogeneous memory modules and memory object level management significantly helps with energy efficiency. This thesis proposes a memory management scheme at a finer granularity: memory object level, and a page allocation policy to leverage the heterogeneity of available memory modules and cater to the diverse memory requirements of workloads. On the on-chip communication side, we introduce an approach to limit the power overhead of PNoC in (3D) many-core systems through cross-layer thermal management. Our proposed thermally-aware workload allocation policies coupled with an adaptive thermal tuning policy minimize the required thermal tuning power for PNoC, and in this way, help broader integration of PNoC. The thesis also introduces techniques in placement and floorplanning of optical devices to reduce optical loss and, thus, laser source power consumption.2018-03-09T00:00:00

Boston University Institutional Repository (OpenBU)

Modeling and optimization of high-performance many-core systems for energy-efficient and reliable computing

Author: Meng Jie
Publication venue: Boston University
Publication date: 01/01/2013
Field of study

Thesis (Ph.D.)--Boston UniversityMany-core systems, ranging from small-scale many-core processors to large-scale high performance computing (HPC) data centers, have become the main trend in computing system design owing to their potential to deliver higher throughput per watt. However, power densities and temperatures increase following the growth in the performance capacity, and bring major challenges in energy efficiency, cooling costs, and reliability. These challenges require a joint assessment of performance, power, and temperature tradeoffs as well as the design of runtime optimization techniques that monitor and manage the interplay among them. This thesis proposes novel modeling and runtime management techniques that evaluate and optimize the performance, energy, and reliability of many-core systems. We first address the energy and thermal challenges in 3D-stacked many-core processors. 3D processors with stacked DRAM have the potential to dramatically improve performance owing to lower memory access latency and higher bandwidth. However, the performance increase may cause 3D systems to exceed the power budgets or create thermal hot spots. In order to provide an accurate analysis and enable the design of efficient management policies, this thesis introduces a simulation framework to jointly analyze performance, power, and temperature for 3D systems. We then propose a runtime optimization policy that maximizes the system performance by characterizing the application behavior and predicting the operating points that satisfy the power and thermal constraints. Our policy reduces the energy-delay product (EDP) by up to 61.9% compared to existing strategies. Performance, cooling energy, and reliability are also critical aspects in HPC data centers. In addition to causing reliability degradation, high temperatures increase the required cooling energy. Communication cost, on the other hand, has a significant impact on system performance in HPC data centers. This thesis proposes a topology-aware technique that maximizes system reliability by selecting between workload clustering and balancing. Our policy improves the system reliability by up to 123.3% compared to existing temperature balancing approaches. We also introduce a job allocation methodology to simultaneously optimize the communication cost and the cooling energy in a data center. Our policy reduces the cooling cost by 40% compared to cooling-aware and performance-aware policies, while achieving comparable performance to performance-aware policy

Boston University Institutional Repository (OpenBU)

A Holistic Solution for Reliability of 3D Parallel Systems

Author: Bagherzadeh Javad
Publication venue
Publication date: 01/01/2020
Field of study

As device scaling slows down, emerging technologies such as 3D integration and carbon nanotube field-effect transistors are among the most promising solutions to increase device density and performance. These emerging technologies offer shorter interconnects, higher performance, and lower power. However, higher levels of operating temperatures and current densities project significantly higher failure rates. Moreover, due to the infancy of the manufacturing process, high variation, and defect densities, chip designers are not encouraged to consider these emerging technologies as a stand-alone replacement for Silicon-based transistors. The goal of this dissertation is to introduce new architectural and circuit techniques that can work around high-fault rates in the emerging 3D technologies, improving performance and reliability comparable to Silicon. We propose a new holistic approach to the reliability problem that addresses the necessary aspects of an effective solution such as detection, diagnosis, repair, and prevention synergically for a practical solution. By leveraging 3D fabric layouts, it proposes the underlying architecture to efficiently repair the system in the presence of faults. This thesis presents a fault detection scheme by re-executing instructions on idle identical units that distinguishes between transient and permanent faults while localizing it to the granularity of a pipeline stage. Furthermore, with the use of a dynamic and adaptive reconfiguration policy based on activity factors and temperature variation, we propose a framework that delivers a significant improvement in lifetime management to prevent faults due to aging. Finally, a design framework that can be used for large-scale chip production while mitigating yield and variation failures to bring up Carbon Nano Tube-based technology is presented. The proposed framework is capable of efficiently supporting high-variation technologies by providing protection against manufacturing defects at different granularities: module and pipeline-stage levels.PHDComputer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168118/1/javadb_1.pd

Deep Blue Documents at the University of Michigan

Enabling multi-threaded execution and improved memory access in fine-grain near-data processing systems

Author: Santos Sairo Raoní dos
Publication venue
Publication date: 01/01/2022
Field of study

Orientador: Marco Antonio Zanata AlvesTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 08/07/2022Inclui referênciasÁrea de concentração: Ciência da ComputaçãoResumo: Aplicações que lidam com grandes quantidades de dados são cada vez mais populares. No entanto, as arquiteturas tradicionais centradas em computação estão mal equipadas para lidar com essas aplicatções, pois elas causam muito movimento de dados no sistema devido aos acessos de dados quase constantes. Isso leva a um processamento ineficiente, com longos tempos de execução e alto consumo de energia. Os problemas causados por essa disparidade são amplamente conhecidos como memory wall. A partir do final da década de 1990, a ideia de mover parte da computação para perto da memória, quando benéfico, começou a ser considerada. Este conceito tornou-se conhecido como processamento próximo à memória e ganhou mais atenção no início da década de 2010 com o advento da tecnologia de Through-Silicon Via (TSV), que permitiu a integração direta das lógicas de processamento e armazenamento de dados no mesmo chip. Memórias 3D, que integram verticalmente armazenamento e lógica, tornaram-se comercialmente disponíveis desde então e pesquisadores da área de arquitetura de computadores reagiram propondo muitos projetos que colocam elementos de processamento na camada lógica normalmente encontrada nesses dispositivos. Esta tese propõe a Vector-In-Memory Architecture (VIMA), uma arquitetura de processamento próximo à memória baseada em memória 3D que implementa o processamento na memória colocando unidades funcionais na camada lógica desses dispositivos. Nosso projeto usa unidades funcionais vetoriais e uma memória cache para armazenamento dedicado e avança o estado da arte implementando exceções precisas e permitindo multi-threading próximo aos dadosna memória. Simulamos a execução de várias aplicações orientadas a dados em nossa arquitetura e, nossos resultados mostram que o design proposto, que utiliza 1 core e a VIMA, é capaz de superar uma arquitetura tradicional moderna de 16 cores em pelo menos 2× ao lidar com grandes tamanhos de conjuntos de dados. Além disso, essa aceleração no tempo de execução é alcançada enquanto se reduz o consumo de energia em pelo menos 75% de acordo com nossas estimativas. Em comparação com um trabalho similar do estado da arte, a VIMA é capaz de reduzir o tempo de execução de aplicações que fazem streaming de dados em pelo menos 32%.Abstract: Applications that deal with large amounts of data are increasingly popular. However, traditional computation-centric architectures are ill-equipped to handle such applications as they cause much data movement across the system due to their near-constant data accesses. This leads to inefficient processing, with long execution times and high energy consumption. Issues caused by this disparity are widely known as the memory wall. Starting in the late 1990s, the idea of moving portions of the computations close to the memory when beneficial began to be considered. This concept has now become known as Near-Data Processing (NDP) and gained more attention in the early 2010s with the advent of TSV technology, which enabled straight-forward integration of processing logic and data storage in the same chip. 3D-stacked memories, which vertically integrate storage and logic, have become commercially available ever since and computer architecture researchers have reacted by proposing many designs that place processing elements on the logic layer typically found in those devices. This thesis proposes VIMA, a 3D-stacked memory-based NDP architecture that implements processing in the memory by placing Functional Units (FUs) on the logic layer of those devices. Our design uses a vector functional units and a cache memory for dedicated storage and advances the state-of-the-art by implementing near-data precise exceptions and enabling near-data multi-threading. We simulate execution of several common data-driven applications on our architecture and, out results show that the proposed design, with only a single processing core and VIMA, is able to outperform a modern 16-thread by at least 2× when dealing with large dataset sizes. Moreover, such a speedup in performance is achieved while reducing energy consumption by at least 75% according to our estimates. In comparison to its most closely related state-of-the-art work, VIMA is able to reduce the execution time of data-streaming applications by at least 32%

Repositório Digital Institucional da UFPR

Universidade Federal do Paraná

Towards resource-aware computing for task-based runtimes and parallel architectures

Author: Chasapis Dimitrios
Publication venue: Universitat Politècnica de Catalunya
Publication date: 01/01/2019
Field of study

Current large scale systems show increasing power demands, to the point that it has become a huge strain on facilities and budgets. The increasing restrictions in terms of power consumption of High Performance Computing (HPC) systems and data centers have forced hardware vendors to include power capping capabilities in their commodity processors. Power capping opens up new opportunities for applications to directly manage their power behavior at user level. However, constraining power consumption causes the individual sockets of a parallel system to deliver different performance levels under the same power cap, even when they are equally designed, which is an effect caused by manufacturing variability. Modern chips suffer from heterogeneous power consumption due to manufacturing issues, a problem known as manufacturing or process variability. As a result, systems that do not consider such variability caused by manufacturing issues lead to performance degradations and wasted power. In order to avoid such negative impact, users and system administrators must actively counteract any manufacturing variability. In this thesis we show that parallel systems benefit from taking into account the consequences of manufacturing variability, in terms of both performance and energy efficiency. In order to evaluate our work we have also implemented our own task-based version of the PARSEC benchmark suite. This allows to test our methodology using state-of-the-art parallelization techniques and real world workloads. We present two approaches to mitigate manufacturing variability, by power redistribution at runtime level and by power- and variability-aware job scheduling at system-wide level. A parallel runtime system can be used to effectively deal with this new kind of performance heterogeneity by compensating the uneven effects of power capping. In the context of a NUMA node composed of several multi core sockets, our system is able to optimize the energy and concurrency levels assigned to each socket to maximize performance. Applied transparently within the parallel runtime system, it does not require any programmer interaction like changing the application source code or manually reconfiguring the parallel system. We compare our novel runtime analysis with an offline approach and demonstrate that it can achieve equal performance at a fraction of the cost. The next approach presented in this theis, we show that it is possible to predict the impact of this variability on specific applications by using variability-aware power prediction models. Based on these power models, we propose two job scheduling policies that consider the effects of manufacturing variability for each application and that ensures that power consumption stays under a system wide power budget. We evaluate our policies under different power budgets and traffic scenarios, consisting of both single- and multi-node parallel applications.Los sistemas modernos de gran escala muestran crecientes demandas de energía, hasta el punto de que se ha convertido en una gran presión para las instalaciones y los presupuestos. Las restricciones crecientes de consumo de energía de los sistemas de alto rendimiento (HPC) y los centros de datos han obligado a los proveedores de hardware a incluir capacidades de limitación de energía en sus procesadores. La limitación de energía abre nuevas oportunidades para que las aplicaciones administren directamente su comportamiento de energía a nivel de usuario. Sin embargo, la restricción en el consumo de energía de sockets individuales de un sistema paralelo resulta en diferentes niveles de rendimiento, por el mismo límite de potencia, incluso cuando están diseñados por igual. Esto es un efecto causado durante el proceso de la fabricación. Los chips modernos sufren de un consumo de energía heterogéneo debido a problemas de fabricación, un problema conocido como variabilidad del proceso o fabricación. Como resultado, los sistemas que no consideran este tipo de variabilidad causada por problemas de fabricación conducen a degradaciones del rendimiento y desperdicio de energía. Para evitar dicho impacto negativo, los usuarios y administradores del sistema deben contrarrestar activamente cualquier variabilidad de fabricación. En esta tesis, demostramos que los sistemas paralelos se benefician de tener en cuenta las consecuencias de la variabilidad de la fabricación, tanto en términos de rendimiento como de eficiencia energética. Para evaluar nuestro trabajo, también hemos implementado nuestra propia versión del paquete de aplicaciones de prueba PARSEC, basada en tareas paralelos. Esto permite probar nuestra metodología utilizando técnicas avanzadas de paralelización con cargas de trabajo del mundo real. Presentamos dos enfoques para mitigar la variabilidad de fabricación, mediante la redistribución de la energía a durante la ejecución de las aplicaciones y mediante la programación de trabajos a nivel de todo el sistema. Se puede utilizar un sistema runtime paralelo para tratar con eficacia este nuevo tipo de heterogeneidad de rendimiento, compensando los efectos desiguales de la limitación de potencia. En el contexto de un nodo NUMA compuesto de varios sockets y núcleos, nuestro sistema puede optimizar los niveles de energía y concurrencia asignados a cada socket para maximizar el rendimiento. Aplicado de manera transparente dentro del sistema runtime paralelo, no requiere ninguna interacción del programador como cambiar el código fuente de la aplicación o reconfigurar manualmente el sistema paralelo. Comparamos nuestro novedoso análisis de runtime con los resultados óptimos, obtenidos de una análisis manual exhaustiva, y demostramos que puede lograr el mismo rendimiento a una fracción del costo. El siguiente enfoque presentado en esta tesis, muestra que es posible predecir el impacto de la variabilidad de fabricación en aplicaciones específicas mediante el uso de modelos de predicción de potencia conscientes de la variabilidad. Basados en estos modelos de predicción de energía, proponemos dos políticas de programación de trabajos que consideran los efectos de la variabilidad de fabricación para cada aplicación y que aseguran que el consumo se mantiene bajo un presupuesto de energía de todo el sistema. Evaluamos nuestras políticas con diferentes presupuestos de energía y escenarios de tráfico, que consisten en aplicaciones paralelas que corren en uno o varios nodos.Postprint (published version

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Tesis Doctorals en Xarxa

Towards resource-aware computing for task-based runtimes and parallel architectures

Author: Chasapis Dimitrios
Publication venue: Universitat Politècnica de Catalunya
Publication date: 24/04/2019
Field of study

Tesis Doctorals en Xarxa

서비스 균등 분배와 고성능을 위한 다중프로세서칩 상의 재구성형 통신 구조

Author: 박한민
Publication venue: 서울대학교 대학원
Publication date: 01/02/2016
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2016. 2. 최기영.The chip multiprocessor (CMP) era has long begun due to the diminishing return from instruction-level parallelism (ILP) harvesting techniques, the rising power and temperature from frequency scaling, etc. One powerful processor has been replaced by many less-powerful processors forming a CMP. One of the issues arose from this paradigm shift is the management of communication among the processors. Buses, which has been a common choice for the systems with one or several processors, failed to sustain the increased communication burden of CMPs. Many bus-based improvements including hierarchical buses and bus-matrices, were proposed but eventually, network-on-chip (NoC) has become the de facto standard for designing a CMP system, replacing the bus-based techniques. NoCs strengths over bus mainly come from its capability of conveying multiple transactions simultaneously from different components to the others. The concurrent communications between the cores are conducted by the distributed, yet shared network components, routers. Routers provide cores with services such as bandwidths. One of the design issues in implementing NoC is to distribute these services evenly across all the cores requesting for them. Arbiter is a component that regulates the accesses to shared resources such as channels and buffers. It has the policy under which requests get services in turn from the shared resources so that the requestors dont fall into deadlock or starvation. One of the common policies for an arbiter is the round-robin, where requests get their grant one by one so that fairness is assured among the requestors. When applied to routers in NoC, it fails to provide the fairness because each request goes through multiple routers, thus multiple round-robin arbiters on a transaction route. The cascaded effect of the round-robin arbitration is that the farther a source is from the destination, the less service it gets from the destination. The first part of this thesis addresses this issue, and proposes thus far the simplest yet the most effective way of providing the fairness to all the nodes on NoC. It applies weighted round-robin scheme where the weights are determined at run-time depending on which cores are allocated to applications or threads running on the CMP. RTL implementation and synthesis are done to show the simplicity of the proposed scheme. Simulation with synthetic traffic patterns and SPEC CPU2006 benchmark applications show that the proposed approach results in outstanding equality-of-service characteristics. The second part of this thesis deals with the impact of the reconfigurable communication architecture on the performance of a CMP system. One of the pitfalls of NoC is long access latency due to increased hop count between a source and its destination. For example, NoC with mesh topology has its hop count proportional to its size. Because of this, while being a common choice for CMP, mesh topology is said to be inscalable in terms of the number of cores. Some alternatives to mesh topology exist, one of them being high radix NoCs. They replace short and wide channels of mesh with long and narrow ones achieving fewer hop counts. Another option is to cluster cores so that the dimension of mesh network reduces. The clusters are formed by grouping cores via local communication fabric. The clusters are interconnected by a global communication fabric, often in the shape of mesh topology. Many types of local communication fabric are explored in previous researches, including another NoC with topologies of mesh, ring, etc. However, bus has become one of the most favorable choices for the local connection because of its simplicity. The simplicity leads local communications to be performed with high performance, low chip area, low power consumption, etc. One of the issues in forming core clusters in CMP is their grain size. Tying too many cores into a cluster results in the congestion on the bus, reducing the performance of the local communications. On the other hand, too few cores in a cluster misses the chances of improving system performance by efficient local communications through the bus. It is obvious that the optimal number of cores in a cluster depends on the applications that run on the CMP. Bus reconfiguration with bus segments and switches can be a solution for varying cluster size on a CMP. In addition to the variable cluster sizes, bus reconfiguration has another advantage of processor (not process) migration. Bus reconfiguration can reconnect cores and caches so that the distance between cores and data are reduced dynamically. In this way, data copies and network transactions can be dramatically reduced to improve the system performance. The second part of this thesis addresses this issue and proposes a reconfigurable bus-mesh architecture to accelerate pipelined applications. With the proposed architecture, the data transfer between the successive pipeline stages are done not by data copies but by processor migrations. Systematic management of bus segments and L1 data caches are required to achieve efficient use of the reconfigurability. The proposed architecture is compared with the baseline architecture, which maintains cache coherence with hardware. Multilayer perceptron (MLP), convolutional neural network (CNN), and JPEG decoder are implemented as example pipelined applications using multi-threaded programming model. The in-house full system simulator is implemented and used to measure the performance improvement of the proposed architecture. The experimental results show that 21.75 %, 14.40 %, and 12.74 % execution cycle reductions are achieved for MLP, CNN, and JPEG decoder, respectively.Part I Adaptively Weighted Round-Robin Arbitration for Equality of Service in a Many-Core Network-on-Chip [1] 1 Chapter 1 Introduction 3 Chapter 2 Previous Work 7 Chapter 3 Position-Based Weighted Round-Robin Arbitration 11 Chapter 4 Adaptively Weighted Round-Robin Arbitration 17 4.1 Hardware Implementation for weight update 18 4.2 Arbitration Weight Determination 22 Chapter 5 Experimental Results 25 5.1 Open-Loop Measurements 25 5.2 Closed-Loop Measurements 29 5.3 Hardware Implementation 33 Chapter 6 Conclusion 35 Part II Accelerating Pipelined Applications with Reconfigurable Bus-Mesh Communication Architecture in Chip Multiprocessors 37 Chapter 7 Introduction 39 Chapter 8 Backgrounds and Previous Work 43 8.1 Segmented Bus 43 8.2 CMPs with Reconfigurable Bus-Mesh Communication Architecture 44 8.3 Near-Threshold Computing 48 Chapter 9 Baseline Architecture 51 Chapter 10 Motivation 55 Chapter 11 Reconfigurable Bus-Mesh Architecture 61 11.1 Thread Programming Model 61 11.2 Cluster Size 64 11.3 Organizing Multiple L1Ds and SPM Banks in a Cluster 66 11.4 L1 Data Cache / SPM Partitioning 70 11.5 Reconfiguration Overheads 71 Chapter 12 Experimental Results 75 12.1 Pipelined Applications 75 12.2 Simulation Environment 78 12.3 Memory Operations Latency Breakdown 79 Chapter 13 Conclusion 85 Bibliography 87 국문초록 95Docto

SNU Open Repository and Archive