6 research outputs found

    Optimization of Reliability and Power Consumption in Systems on a Chip

    Get PDF
    Aggressive transistor scaling, decreased voltage margins and increased processor power and temperature, have made reliability assessment a much more significant issue in design. Although reliability of devices and interconnect has been broadly studied, here we characterize reliability at the system level. Thus we consider component-based System on Chip designs. Reliability is strongly affected by system temperature, which is in turn driven by power consumption. Thus, component reliability and their power management should be addressed jointly. We present here a joint reliability and power management optimization problem whose solution is an optimal management policy. When careful joint policy optimization is performed, we obtain a significant improvement in energy consumption (40%) in tandem with meeting reliability constraint for all operating temperatures

    An Application-Specific Design Methodology for On-chip Crossbar Generation

    Get PDF
    Designing a power-efficient interconnection architec- ture for MultiProcessor Systems-on-Chips (MPSoCs) satisfying the application performance constraints is a nontrivial task. In order to meet the tight time-to-market constraints and to effec- tively handle the design complexity, it is essential to provide a computer-aided design tool support for automating this task. In this paper, we address the issue of “application-specific design of optimal crossbar architecture” satisfying the performance re- quirements of the application and optimal binding of the cores onto the crossbar resources. We present a simulation-based design approach that is based on the analysis of the actual traffic trace of the application, considering local variations in traffic rates, temporal overlap among traffic streams, and criticality of traffic streams. Our approach is physical design aware, where the wiring complexity of the crossbar architecture is also considered during the design process. This leads to detecting timing violations on the wires early in the design cycle and to having accurate estimates of the power consumption on the wires. We apply our methodology onto several MPSoC designs, and the synthesized crossbar plat- forms are validated for performance by cycle-accurate SystemC simulation of the designs. The crossbar matrix power consumption values are based on the synthesis of the register transfer level models of the designs, obtained using industry standard tools. The experimental case studies show large reduction in communication architecture power consumption (45.3% on average) and total wirelength (38% on average) for the MPSoC designs when com- pared with traditional design approaches. The synthesized cross- bar designs also lead to large reduction in transaction latencies (up to 7×) when compared with the existing design approaches

    Application aware performance, power consumption, and reliability tradeoff

    Get PDF
    There has been an unprecedented increase in the drive for microprocessor performance. This drive is motivated by the increase in software complexity, opportunity to solve previously unattempted problems especially in scientific domain, and a need to crunch the ever growing `Big Data\u27 to enable a multitude of technological advances to benefit mankind. A consequence of these phenomena is the ever increasing transistor count in deployed computing systems. Although technology scaling leads to lower power consumption per transistor, the overall system level power consumption is on the rise. This leads to a variety of power supply related issues. As the chip die area is not increasing significantly, and the supply voltage reduction is not keeping on par with the reduction in device dimensions, an increase in power density is observed. This manifests as an increased temperature profile on the chip floorplan. A rise in temperature necessitates aggressive and costly cooling mechanisms adding to the design complexity and manufacturing efforts. It also triggers various failure mechanisms leading to reduction in the expected chip lifetime/reliability. Given the conflicting trends in Performance, Power consumption, and chip Reliability (PPR), it is imperative to balance them in a fine-grained fashion to meet system level goals and expectations. Sole dependence on the advancements in manufacturing technology is no longer sufficient. Alternate venues for PPR management are being increasingly paid attention to. On the other hand, the PPR demands are usually time dependent. For example, the constraint on power consumption in a green data center is dictated by the energy reserve. The demand on performance in a cloud based platform depends on the agreed Quality of Service (QOS) requirements. The reliability of a microprocessor is dependent on the deployment domain. The goal of our research is to address the issue of growing microprocessor power consumption subject to performance and/or reliability goals. Through our developed schemes, we tailor the execution context to match application requirements. This leads to judicious use of power while adhering to aforementioned constraints. It is to be noted that the actual demands on performance, power consumption, and reliability are highly variant, and depend upon executing applications and operating conditions. As such, we develop schemes to cater to these variant demands. To meet these demands efficiently, the solutions developed are tailored to current hardware-software interaction characteristics. Two techniques that are very relevant in this area, namely dynamic voltage and frequency scaling (DVFS) and microarchitectural adaptation, are leveraged to produce expected PPR characteristics when executing a wide variety of tasks. In this dissertation, we demonstrate how the expected chip lifetime can be augmented in a real-time setting using DVFS while paying heed to performance constraints modeled as QoS requirements. Individual tasks in a task queue are assigned specific voltage and frequency pairs to utilize for their execution. This assignment is empowered by knowledge of application-wise hardware-software interactions to reach solutions that are tailored to the current execution scenario. Our observations indicate that a 2 to 18 fold improvement in chip lifetime can be expected by the utilization of the schemes we develop in this regard. Capitalizing on the power of microarchitectural adaptation, we further improve chip lifetime expectations 2-8 times, based on the failure mechanism investigated. This increase in expected chip lifetime directly translates to reduction of both operational and replacement costs. We also provide mechanisms to co-manage performance and power consumption constraints. Comprehensive microarchitectural adaptation space is very complex and its usage thus leads to significant runtime overhead. To tackle this, we devote a fair bit of attention to its pruning so as to narrow down on and utilize only the most effective adaptations. A two stage adaptation process is provided to a) improve optimality of the solutions delivered, and b) to keep the runtime overhead in check. We observe that our schemes provide 20\% higher normalized energy efficiency compared to the state of the art techniques proposed, while using just a very small fraction of the configuration space. We also find that our schemes effectively cater to a wide variety of demands on performance and power consumption, providing the necessary hardware characteristics within 10\% bound. Since only the most useful configuration space is retained for adaptation, occurrence of a fault that prohibits the usage of a certain adaptive control can lead to the inability to satisfy a subset of hardware demands. A detailed analysis has been carried out to understand how the remaining active configurations can preserve the expected hardware behavior. To a good extent, we observe that the system behavior under a failure closely tracks (with less than 5\% tracking error) the obtainable behavior without the presence of the fault. We believe that application tailored schemes for PPR management become increasingly relevant as the microprocessor design advancements saturate in the future. They will be extremely relevant to extract every possible ounce of performance while confirming to constraints on power consumption and reliability. Given the effectiveness of our schemes, we are confident that such schemes are applicable in different markets like embedded computing, desktop computing, cloud platforms and high performance computing. Insights drawn from our research will guide chip designers in the provision of effective adaptive controls to combat increasing demands on PPR characteristics

    Resource management for extreme scale high performance computing systems in the presence of failures

    Get PDF
    2018 Summer.Includes bibliographical references.High performance computing (HPC) systems, such as data centers and supercomputers, coordinate the execution of large-scale computation of applications over tens or hundreds of thousands of multicore processors. Unfortunately, as the size of HPC systems continues to grow towards exascale complexities, these systems experience an exponential growth in the number of failures occurring in the system. These failures reduce performance and increase energy use, reducing the efficiency and effectiveness of emerging extreme-scale HPC systems. Applications executing in parallel on individual multicore processors also suffer from decreased performance and increased energy use as a result of applications being forced to share resources, in particular, the contention from multiple application threads sharing the last-level cache causes performance degradation. These challenges make it increasingly important to characterize and optimize the performance and behavior of applications that execute in these systems. To address these challenges, in this dissertation we propose a framework for intelligently characterizing and managing extreme-scale HPC system resources. We devise various techniques to mitigate the negative effects of failures and resource contention in HPC systems. In particular, we develop new HPC resource management techniques for intelligently utilizing system resources through the (a) optimal scheduling of applications to HPC nodes and (b) the optimal configuration of fault resilience protocols. These resource management techniques employ information obtained from historical analysis as well as theoretical and machine learning methods for predictions. We use these data to characterize system performance, energy use, and application behavior when operating under the uncertainty of performance degradation from both system failures and resource contention. We investigate how to better characterize and model the negative effects from system failures as well as application co-location on large-scale HPC computing systems. Our analysis of application and system behavior also investigates: the interrelated effects of network usage of applications and fault resilience protocols; checkpoint interval selection and its sensitivity to system parameters for various checkpoint-based fault resilience protocols; and performance comparisons of various promising strategies for fault resilience in exascale-sized systems

    Optimization of Reliability and Power Consumption in Systems on a Chip

    No full text
    Abstract. Aggressive transistor scaling, decreased voltage margins and increased processor power and temperature, have made reliability assessment a much more significant issue in design. Although reliability of devices and interconnect has been broadly studied, here we characterize reliability at the system level. Thus we consider component-based System on Chip designs. Reliability is strongly affected by system temperature, which is in turn driven by power consumption. Thus, component reliability and their power management should be addressed jointly. We present here a joint reliability and power management optimization problem whose solution is an optimal management policy. When careful joint policy optimization is performed, we obtain a significant improvement in energy consumption (40%) in tandem with meeting reliability constraint for all operating temperatures.
    corecore