78 research outputs found

    Design Space Exploration and Resource Management of Multi/Many-Core Systems

    Get PDF
    The increasing demand of processing a higher number of applications and related data on computing platforms has resulted in reliance on multi-/many-core chips as they facilitate parallel processing. However, there is a desire for these platforms to be energy-efficient and reliable, and they need to perform secure computations for the interest of the whole community. This book provides perspectives on the aforementioned aspects from leading researchers in terms of state-of-the-art contributions and upcoming trends

    Power Bounded Computing on Current & Emerging HPC Systems

    Get PDF
    Power has become a critical constraint for the evolution of large scale High Performance Computing (HPC) systems and commercial data centers. This constraint spans almost every level of computing technologies, from IC chips all the way up to data centers due to physical, technical, and economic reasons. To cope with this reality, it is necessary to understand how available or permissible power impacts the design and performance of emergent computer systems. For this reason, we propose power bounded computing and corresponding technologies to optimize performance on HPC systems with limited power budgets. We have multiple research objectives in this dissertation. They center on the understanding of the interaction between performance, power bounds, and a hierarchical power management strategy. First, we develop heuristics and application aware power allocation methods to improve application performance on a single node. Second, we develop algorithms to coordinate power across nodes and components based on application characteristic and power budget on a cluster. Third, we investigate performance interference induced by hardware and power contentions, and propose a contention aware job scheduling to maximize system throughput under given power budgets for node sharing system. Fourth, we extend to GPU-accelerated systems and workloads and develop an online dynamic performance & power approach to meet both performance requirement and power efficiency. Power bounded computing improves performance scalability and power efficiency and decreases operation costs of HPC systems and data centers. This dissertation opens up several new ways for research in power bounded computing to address the power challenges in HPC systems. The proposed power and resource management techniques provide new directions and guidelines to green exscale computing and other computing systems

    Adaptive Distributed Architectures for Future Semiconductor Technologies.

    Full text link
    Year after year semiconductor manufacturing has been able to integrate more components in a single computer chip. These improvements have been possible through systematic shrinking in the size of its basic computational element, the transistor. This trend has allowed computers to progressively become faster, more efficient and less expensive. As this trend continues, experts foresee that current computer designs will face new challenges, in utilizing the minuscule devices made available by future semiconductor technologies. Today's microprocessor designs are not fit to overcome these challenges, since they are constrained by their inability to handle component failures by their lack of adaptability to a wide range of custom modules optimized for specific applications and by their limited design modularity. The focus of this thesis is to develop original computer architectures, that can not only survive these new challenges, but also leverage the vast number of transistors available to unlock better performance and efficiency. The work explores and evaluates new software and hardware techniques to enable the development of novel adaptive and modular computer designs. The thesis first explores an infrastructure to quantitatively assess the fallacies of current systems and their inadequacy to operate on unreliable silicon. In light of these findings, specific solutions are then proposed to strengthen digital system architectures, both through hardware and software techniques. The thesis culminates with the proposal of a radically new architecture design that can fully adapt dynamically to operate on the hardware resources available on chip, however limited or abundant those may be.PHDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/102405/1/apellegr_1.pd

    Circuits and Systems Advances in Near Threshold Computing

    Get PDF
    Modern society is witnessing a sea change in ubiquitous computing, in which people have embraced computing systems as an indispensable part of day-to-day existence. Computation, storage, and communication abilities of smartphones, for example, have undergone monumental changes over the past decade. However, global emphasis on creating and sustaining green environments is leading to a rapid and ongoing proliferation of edge computing systems and applications. As a broad spectrum of healthcare, home, and transport applications shift to the edge of the network, near-threshold computing (NTC) is emerging as one of the promising low-power computing platforms. An NTC device sets its supply voltage close to its threshold voltage, dramatically reducing the energy consumption. Despite showing substantial promise in terms of energy efficiency, NTC is yet to see widescale commercial adoption. This is because circuits and systems operating with NTC suffer from several problems, including increased sensitivity to process variation, reliability problems, performance degradation, and security vulnerabilities, to name a few. To realize its potential, we need designs, techniques, and solutions to overcome these challenges associated with NTC circuits and systems. The readers of this book will be able to familiarize themselves with recent advances in electronics systems, focusing on near-threshold computing

    Power-constrained aware and latency-aware microarchitectural optimizations in many-core processors

    Get PDF
    As the transistor budgets outpace the power envelope (the power-wall issue), new architectural and microarchitectural techniques are needed to improve, or at least maintain, the power efficiency of next-generation processors. Run-time adaptation, including core, cache and DVFS adaptations, has recently emerged as a promising area to keep the pace for acceptable power efficiency. However, none of the adaptation techniques proposed so far is able to provide good results when we consider the stringent power budgets that will be common in the next decades, so new techniques that attack the problem from several fronts using different specialized mechanisms are necessary. The combination of different power management mechanisms, however, bring extra levels of complexity, since other factors such as workload behavior and run-time conditions must also be considered to properly allocate power among cores and threads. To address the power issue, this thesis first proposes Chrysso, an integrated and scalable model-driven power management that quickly selects the best combination of adaptation methods out of different core and uncore micro-architecture adaptations, per-core DVFS, or any combination thereof. Chrysso can quickly search the adaptation space by making performance/power projections to identify Pareto-optimal configurations, effectively pruning the search space. Chrysso achieves 1.9x better chip performance over core-level gating for multi-programmed workloads, and 1.5x higher performance for multi-threaded workloads. Most existing power management schemes use a centralized approach to regulate power dissipation. Unfortunately, the complexity and overhead of centralized power management increases significantly with core count rendering it in-viable at fine-grain time slices. The work leverages a two-tier hierarchical power manager. This solution is highly scalable with low overhead on a tiled many-core architecture with shared LLC and per-tile DVFS at fine-grain time slices. The global power is first distributed across tiles using GPM and then within a tile (in parallel across all tiles). Additionally, this work also proposes DVFS and cache-aware thread migration (DCTM) to ensure optimum per-tile co-scheduling of compatible threads at runtime over the two-tier hierarchical power manager. DCTM outperforms existing solutions by up to 12% on adaptive many-core tile processor. With the advancements in the core micro-architectural techniques and technology scaling, the performance gap between the computational component and memory component is increasing significantly (the memory-wall issue). To bridge this gap, the architecture community is pushing forward towards multi-core architecture with on-die near-memory DRAM cache memory (faster than conventional DRAM). Gigascale DRAM Caches poses a problem of how to efficiently manage the tags. The Tags-in-DRAM designs aims at efficiently co-locate tags with data, but it still suffer from high latency especially in multi-way associativity. The thesis finally proposes Tag Cache mechanism, an on-chip distributed tag caching mechanism with limited space and latency overhead to bypass the tag read operation in multi-way DRAM Caches, thereby reducing hit latency. Each Tag Cache, stored in L2, stores tag information of the most recently used DRAM Cache ways. The Tag Cache is able to exploit temporal locality of the DRAM Cache, thereby contributing to on average 46% of the DRAM Cache hits.A mesura que el consum dels transistors supera el nivell de potència desitjable es necessiten noves tècniques arquitectòniques i microarquitectòniques per millorar, o almenys mantenir, l'eficiència energètica dels processadors de les pròximes generacions. L'adaptació en temps d'execució, tant de nuclis com de les cachés, així com també adaptacions DVFS són idees que han sorgit recentment que fan preveure que sigui un àrea prometedora per mantenir un ritme d'eficiència energètica acceptable. Tanmateix, cap de les tècniques d'adaptació proposades fins ara és capaç d'oferir bons resultats si tenim en compte les restriccions estrictes de potència que seran comuns a les pròximes dècades. És convenient definir noves tècniques que ataquin el problema des de diversos fronts utilitzant diferents mecanismes especialitzats. La combinació de diferents mecanismes de gestió d'energia porta aparellada nivells addicionals de complexitat, ja que altres factors com ara el comportament de la càrrega de treball així com condicions específiques de temps d'execució també han de ser considerats per assignar adequadament la potència entre els nuclis del sistema computador. Per tractar el tema de la potència, aquesta tesi proposa en primer lloc Chrysso, una administració d'energia integrada i escalable que selecciona ràpidament la millor combinació entre diferents adaptacions microarquitectòniques. Chrysso pot buscar ràpidament l'adaptació adequada al fer projeccions òptimes de rendiment i potència basades en configuracions de Pareto, permetent així reduir de manera efectiva l'espai de cerca. Chrysso arriba a un rendiment de 1,9 sobre tècniques convencionals d'inhibició de portes amb una càrrega d'aplicacions seqüencials; i un rendiment de 1,5 quan les aplicacions corresponen a programes parla·lels. La majoria dels sistemes de gestió d'energia existents utilitzen un enfocament centralitzat per regular la dissipació d'energia. Malauradament, la complexitat i el temps d'administració s'incrementen significativament amb una gran quantitat de nuclis. En aquest treball es defineix un gestor jeràrquic de potència basat en dos nivells. Aquesta solució és altament escalable amb baix cost operatiu en una arquitectura de múltiples nuclis integrats en clústers, amb memòria caché de darrer nivell compartida a nivell de cluster, i DVFS establert en intervals de temps de gra fi a nivell de clúster. La potència global es distribueix en primer lloc a través dels clústers utilitzant GPM i després es distribueix dins un clúster (en paral·lel si es consideren tots els clústers). A més, aquest treball també proposa DVFS i migració de fils conscient de la memòria caché (DCTM) que garanteix una òptima distribució de tasques entre els nuclis. DCTM supera les solucions existents fins a un 12%. Amb els avenços en la tecnologia i les tècniques de micro-arquitectura de nuclis, la diferència de rendiment entre el component computacional i la memòria està augmentant significativament. Per omplir aquest buit, s'està avançant cap a arquitectures de múltiples nuclis amb memòries caché integrades basades en DRAM. Aquestes memòries caché DRAM a gran escala plantegen el problema de com gestionar de forma eficaç les etiquetes. Els dissenys de cachés amb dades i etiquetes juntes són un primer pas, però encara pateixen per tenir una alta latència, especialment en cachés amb un grau alt d'associativitat. En aquesta tesi es proposa l'estudi d'una tècnica anomenada Tag Cache, un mecanisme distribuït d'emmagatzematge d'etiquetes, que redueix la latència de les operacions de lectura d'etiquetes en les memòries caché DRAM. Cada Tag Cache, que resideix a L2, emmagatzema la informació de les vies que s'han accedit recentment de les memòries caché DRAM. D'aquesta manera es pot aprofitar la localitat temporal d'una caché DRAM, fet que contribueix en promig en un 46% dels encerts en les caché DRAM

    Internet of Things and Neural Network Based Energy Optimization and Predictive Maintenance Techniques in Heterogeneous Data Centers

    Full text link
    Rapid growth of cloud-based systems is accelerating growth of data centers. Private and public cloud service providers are increasingly deploying data centers all around the world. The need for edge locations by cloud computing providers has created large demand for leasing space and power from midsize data centers in smaller cities. Midsize data centers are typically modular and heterogeneous demanding 100% availability along with high service level agreements. Data centers are recognized as an increasingly troublesome percentage of electricity consumption. Growing energy costs and environmental responsibility have placed the data center industry, particularly midsize data centers under increasing pressure to improve its operational efficiency. The power consumption is mainly due to servers and networking devices on computing side and cooling systems on the facility side. The facility side systems have complex interactions with each other. The static control logic and high number of configuration and nonlinear interdependency create challenges in understanding and optimizing energy efficiency. Doing analytical or experimental approach to determine optimum configuration is very challenging however, a learning based approach has proven to be effective for optimizing complex operations. Machine learning methodologies have proven to be effective for optimizing complex systems. In this thesis, we utilize a learning engine that learns from operationally collected data to accurately predict Power Usage Effectiveness (PUE) and creation of intelligent method to validate and test results. We explore new techniques on how to design and implement Internet of Things (IoT) platform to collect, store and analyze data. First, we study using machine learning framework to predictively detect issues in facility side systems in a modular midsize data center. We propose ways to recognize gaps between optimal values and operational values to identify potential issues. Second, we study using machine learning techniques to optimize power usage in facility side systems in a modular midsize data center. We have experimented with neural network controllers to further optimize the data suite cooling system energy consumption in real time. We designed, implemented, and deployed an Internet of Things framework to collect relevant information from facility side infrastructure. We designed flexible configuration controllers to connect all facility side infrastructure within data center ecosystem. We addressed resiliency by creating reductant controls network and mission critical alerting via edge device. The data collected was also used to enhance service processes that improved operational service level metrics. We observed high impact on service metrics with faster response time (increased 77%) and first time resolution went up by 32%. Further, our experimental results show that we can predictively identify issues in the cooling systems. And, the anomalies in the systems can be identified 30 days to 60 days ahead. We also see the potential to optimize power usage efficiency in the range of 3% to 6%. In the future, more samples of issues and corrective actions can be analyzed to create practical implementation of neural network based controller for real-time optimization.Ph.D.Information Systems Engineering, College of Engineering and Computer ScienceUniversity of Michigan-Dearbornhttp://deepblue.lib.umich.edu/bitstream/2027.42/136074/1/Final Dissertation Vishal Singh.pdfDescription of Final Dissertation Vishal Singh.pdf : Dissertatio

    Virtual Runtime Application Partitions for Resource Management in Massively Parallel Architectures

    Get PDF
    This thesis presents a novel design paradigm, called Virtual Runtime Application Partitions (VRAP), to judiciously utilize the on-chip resources. As the dark silicon era approaches, where the power considerations will allow only a fraction chip to be powered on, judicious resource management will become a key consideration in future designs. Most of the works on resource management treat only the physical components (i.e. computation, communication, and memory blocks) as resources and manipulate the component to application mapping to optimize various parameters (e.g. energy efficiency). To further enhance the optimization potential, in addition to the physical resources we propose to manipulate abstract resources (i.e. voltage/frequency operating point, the fault-tolerance strength, the degree of parallelism, and the configuration architecture). The proposed framework (i.e. VRAP) encapsulates methods, algorithms, and hardware blocks to provide each application with the abstract resources tailored to its needs. To test the efficacy of this concept, we have developed three distinct self adaptive environments: (i) Private Operating Environment (POE), (ii) Private Reliability Environment (PRE), and (iii) Private Configuration Environment (PCE) that collectively ensure that each application meets its deadlines using minimal platform resources. In this work several novel architectural enhancements, algorithms and policies are presented to realize the virtual runtime application partitions efficiently. Considering the future design trends, we have chosen Coarse Grained Reconfigurable Architectures (CGRAs) and Network on Chips (NoCs) to test the feasibility of our approach. Specifically, we have chosen Dynamically Reconfigurable Resource Array (DRRA) and McNoC as the representative CGRA and NoC platforms. The proposed techniques are compared and evaluated using a variety of quantitative experiments. Synthesis and simulation results demonstrate VRAP significantly enhances the energy and power efficiency compared to state of the art.Siirretty Doriast

    Domain-aware Genetic Algorithms for Hardware and Mapping Optimization for Efficient DNN Acceleration

    Get PDF
    The proliferation of AI across a variety of domains (vision, language, speech, recommendations, games) has led to the rise of domain-specific accelerators for deep learning. At design-time, these accelerators carefully architect the on-chip dataflow to maximize data reuse (over space and time) and size the hardware resources (PEs and buffers) to maximize performance and energy-efficiency, while meeting the chip’s area and power targets. At compile-time, the target Deep Neural Network (DNN) model is mapped over the accelerator. The mapping refers to tiling the computation and data (i.e., tensors) and scheduling them over the PEs and scratchpad buffers respectively, while honoring the microarchitectural constraints (number of PEs, buffer sizes, and dataflow). The design-space of valid hardware resource assignments for a given dataflow and the valid mappings for a given hardware is extremely large (~O(10^24)) per layer for state-of-the-art DNN models today. This makes exhaustive searches infeasible. Unfortunately, there can be orders of magnitude performance and energy-efficiency differences between an optimal and sub-optimal choice, making these decisions a crucial part of the entire design process. Moreover, manual tuning by domain experts become unprecedentedly challenged due to increased irregularity (due to neural architecture search) and sparsity of DNN models. This necessitate the existence of Map Space Exploration (MSE). In this thesis, our goal is to deliver a deep analysis of the MSE for DNN accelerators, propose different techniques to improve MSE, and generalize the MSE framework to a wider landscape (from mapping to HW-mapping co-exploration, from single-accelerator to multi-accelerator scheduling). As part of it, we discuss the correlation between hardware flexibility and the formed map space, formalized the map space representation by four mapping axes: tile, order, parallelism, and shape. Next, we develop dedicated exploration operators for these axes and use genetic algorithm framework to converge the solution. Next, we develop "sparsity-aware" technique to enable sparsity consideration in MSE and a "warm-start" technique to solve the search speed challenge commonly seen across learning-based search algorithms. Finally, we extend out MSE to support hardware and map space co-exploration and multi-accelerator scheduling.Ph.D
    • …
    corecore