913 research outputs found
Recommended from our members
On thermal sensor calibration and software techniques for many-core thermal management
The high power density of a many-core processor results in increased temperature which negatively impacts system reliability and performance. Dynamic thermal management applies thermal-aware techniques at run time to avoid overheating using temperature information collected from on-chip thermal sensors. Temperature sensing and thermal control schemes are two critical technologies for successfully maintaining thermal safety. In this dissertation, on-line thermal sensor calibration schemes are developed to provide accurate temperature information.
Software-based dynamic thermal management techniques are proposed using calibrated thermal sensors. Due to process variation and silicon aging, on-chip thermal sensors require periodic calibration before use in DTM. However, the calibration cost for thermal sensors can be prohibitively high as the number of on-chip sensors increases. Linear models which are suitable for on-line calculation are employed to estimate temperatures at multiple sensor locations using performance counters. The estimated temperature and the actual sensor thermal profile show a very high similarity with correlation coefficient ~0.9 for SPLASH2 and SPEC2000 benchmarks.
A calibration approach is proposed to combine potentially inaccurate temperature values obtained from two sources: thermal sensor readings and temperature estimations. A data fusion strategy based on Bayesian inference, which combines information from these two sources, is demonstrated. The result shows the strategy can effectively recalibrate sensor readings in response to inaccuracies caused by process variation and environmental noise. The average absolute error of the corrected sensor temperature readings is
A dynamic task allocation strategy is proposed to address localized overheating in many-core systems. Our approach employs reinforcement learning, a dynamic machine learning algorithm that performs task allocation based on current temperatures and a prediction regarding which assignment will minimize the peak temperature. Our results show that the proposed technique is fast (scheduling performed in \u3c1 \u3ems) and can efficiently reduce peak temperature by up to 8 degree C in a 49-core processor (6% on average) versus a leading competing task allocation approach for a series of SPLASH-2 benchmarks. Reinforcement learning has also been applied to 3D integrated circuits to allocate tasks with thermal awareness
CoMeT: An Integrated Interval Thermal Simulation Toolchain for 2D, 2.5 D, and 3D Processor-Memory Systems
Processing cores and the accompanying main memory working in tandem enable
the modern processors. Dissipating heat produced from computation, memory
access remains a significant problem for processors. Therefore, processor
thermal management continues to be an active research topic. Most thermal
management research takes place using simulations, given the challenges of
measuring temperature in real processors. Since core and memory are fabricated
on separate packages in most existing processors, with the memory having lower
power densities, thermal management research in processors has primarily
focused on the cores.
Memory bandwidth limitations associated with 2D processors lead to
high-density 2.5D and 3D packaging technology. 2.5D packaging places cores and
memory on the same package. 3D packaging technology takes it further by
stacking layers of memory on the top of cores themselves. Such packagings
significantly increase the power density, making processors prone to heating.
Therefore, mitigating thermal issues in high-density processors (packaged with
stacked memory) becomes an even more pressing problem. However, given the lack
of thermal modeling for memories in existing interval thermal simulation
toolchains, they are unsuitable for studying thermal management for
high-density processors.
To address this issue, we present CoMeT, the first integrated Core and Memory
interval Thermal simulation toolchain. CoMeT comprehensively supports thermal
simulation of high- and low-density processors corresponding to four different
core-memory configurations - off-chip DDR memory, off-chip 3D memory, 2.5D, and
3D. CoMeT supports several novel features that facilitate overlying system
research. Compared to an equivalent state-of-the-art core-only toolchain, CoMeT
adds only a ~5% simulation-time overhead. The source code of CoMeT has been
made open for public use under the MIT license.Comment: https://github.com/marg-tools/CoMe
A survey on scheduling and mapping techniques in 3D Network-on-chip
Network-on-Chips (NoCs) have been widely employed in the design of
multiprocessor system-on-chips (MPSoCs) as a scalable communication solution.
NoCs enable communications between on-chip Intellectual Property (IP) cores and
allow those cores to achieve higher performance by outsourcing their
communication tasks. Mapping and Scheduling methodologies are key elements in
assigning application tasks, allocating the tasks to the IPs, and organising
communication among them to achieve some specified objectives. The goal of this
paper is to present a detailed state-of-the-art of research in the field of
mapping and scheduling of applications on 3D NoC, classifying the works based
on several dimensions and giving some potential research directions
Investigation into yield and reliability enhancement of TSV-based three-dimensional integration circuits
Three dimensional integrated circuits (3D ICs) have been acknowledged as a promising technology to overcome the interconnect delay bottleneck brought by continuous CMOS scaling. Recent research shows that through-silicon-vias (TSVs), which act as vertical links between layers, pose yield and reliability challenges for 3D design. This thesis presents three original contributions.The first contribution presents a grouping-based technique to improve the yield of 3D ICs under manufacturing TSV defects, where regular and redundant TSVs are partitioned into groups. In each group, signals can select good TSVs using rerouting multiplexers avoiding defective TSVs. Grouping ratio (regular to redundant TSVs in one group) has an impact on yield and hardware overhead. Mathematical probabilistic models are presented for yield analysis under the influence of independent and clustering defect distributions. Simulation results using MATLAB show that for a given number of TSVs and TSV failure rate, careful selection of grouping ratio results in achieving 100% yield at minimal hardware cost (number of multiplexers and redundant TSVs) in comparison to a design that does not exploit TSV grouping ratios. The second contribution presents an efficient online fault tolerance technique based on redundant TSVs, to detect TSV manufacturing defects and address thermal-induced reliability issue. The proposed technique accounts for both fault detection and recovery in the presence of three TSV defects: voids, delamination between TSV and landing pad, and TSV short-to-substrate. Simulations using HSPICE and ModelSim are carried out to validate fault detection and recovery. Results show that regular and redundant TSVs can be divided into groups to minimise area overhead without affecting the fault tolerance capability of the technique. Synthesis results using 130-nm design library show that 100% repair capability can be achieved with low area overhead (4% for the best case). The last contribution proposes a technique with joint consideration of temperature mitigation and fault tolerance without introducing additional redundant TSVs. This is achieved by reusing spare TSVs that are frequently deployed for improving yield and reliability in 3D ICs. The proposed technique consists of two steps: TSV determination step, which is for achieving optimal partition between regular and spare TSVs into groups; The second step is TSV placement, where temperature mitigation is targeted while optimizing total wirelength and routing difference. Simulation results show that using the proposed technique, 100% repair capability is achieved across all (five) benchmarks with an average temperature reduction of 75.2? (34.1%) (best case is 99.8? (58.5%)), while increasing wirelength by a small amount
Temperature-Aware Design and Management for 3D Multi-Core Architectures
Vertically-integrated 3D multiprocessors systems-on-chip (3D MPSoCs) provide the means to continue integrating more functionality within a unit area while enhancing manufacturing yields and runtime performance. However, 3D MPSoCs incur amplified thermal challenges that undermine the corresponding reliability. To address these issues, several advanced cooling technologies, alongside temperature-aware design-time optimizations and run-time management schemes have been proposed. In this monograph, we provide an overall survey on the recent advances in temperature-aware 3D MPSoC considerations. We explore the recent advanced cooling strategies, thermal modeling frameworks, design-time optimizations and run-time thermal management schemes that are primarily targeted for 3D MPSoCs. Our aim of proposing this survey is to provide a global perspective, highlighting the advancements and drawbacks on the recent state-of-the-ar
Improving processor efficiency through thermal modeling and runtime management of hybrid cooling strategies
One of the main challenges in building future high performance systems is the ability to maintain safe on-chip temperatures in presence of high power densities. Handling such high power densities necessitates novel cooling solutions that are significantly more efficient than their existing counterparts. A number of advanced cooling methods have been proposed to address the temperature problem in processors. However, tradeoffs exist between performance, cost, and efficiency of those cooling methods, and these tradeoffs depend on the target system properties. Hence, a single cooling solution satisfying optimum conditions for any arbitrary system does not exist.
This thesis claims that in order to reach exascale computing, a dramatic improvement in energy efficiency is needed, and achieving this improvement requires a temperature-centric co-design of the cooling and computing subsystems. Such co-design requires detailed system-level thermal modeling, design-time optimization, and runtime management techniques that are aware of the underlying processor architecture and application requirements. To this end, this thesis first proposes compact thermal modeling methods to characterize the complex thermal behavior of cutting-edge cooling solutions, mainly Phase Change Material (PCM)-based cooling, liquid cooling, and thermoelectric cooling (TEC), as well as hybrid designs involving a combination of these. The proposed models are modular and they enable fast and accurate exploration of a large design space. Comparisons against multi-physics simulations and measurements on testbeds validate the accuracy of our models (resulting in less than 1C error on average) and demonstrate significant reductions in simulation time (up to four orders of magnitude shorter simulation times).
This thesis then introduces temperature-aware optimization techniques to maximize energy efficiency of a given system as a whole (including computing and cooling energy). The proposed optimization techniques approach the temperature problem from various angles, tackling major sources of inefficiency. One important angle is to understand the application power and performance characteristics and to design management techniques to match them. For workloads that require short bursts of intense parallel computation, we propose using PCM-based cooling in cooperation with a novel Adaptive Sprinting technique. By tracking the PCM state and incorporating this information during runtime decisions, Adaptive Sprinting utilizes the PCM heat storage capability more efficiently, achieving 29\% performance improvement compared to existing sprinting policies. In addition to the application characteristics, high heterogeneity in on-chip heat distribution is an important factor affecting efficiency. Hot spots occur on different locations of the chip with varying intensities; thus, designing a uniform cooling solution to handle worst-case hot spots significantly reduces the cooling efficiency. The hybrid cooling techniques proposed as part of this thesis address this issue by combining the strengths of different cooling methods and localizing the cooling effort over hot spots. Specifically, the thesis introduces LoCool, a cooling system optimizer that minimizes cooling power under temperature constraints for hybrid-cooled systems using TECs and liquid cooling. Finally, the scope of this work is not limited to existing advanced cooling solutions, but it also extends to emerging technologies and their potential benefits and tradeoffs. One such technology is integrated flow cell array, where fuel cells are pumped through microchannels, providing both cooling and on-chip power generation. This thesis explores a broad range of design parameters including maximum chip temperature, leakage power, and generated power for flow cell arrays in order to maximize the benefits of integrating this technology with computing systems. Through thermal modeling and runtime management techniques, and by exploring the design space of emerging cooling solutions, this thesis provides significant improvements in processor energy efficiency.2018-07-09T00:00:00
- …