397 research outputs found

    DeSyRe: on-Demand System Reliability

    No full text
    The DeSyRe project builds on-demand adaptive and reliable Systems-on-Chips (SoCs). As fabrication technology scales down, chips are becoming less reliable, thereby incurring increased power and performance costs for fault tolerance. To make matters worse, power density is becoming a significant limiting factor in SoC design, in general. In the face of such changes in the technological landscape, current solutions for fault tolerance are expected to introduce excessive overheads in future systems. Moreover, attempting to design and manufacture a totally defect and fault-free system, would impact heavily, even prohibitively, the design, manufacturing, and testing costs, as well as the system performance and power consumption. In this context, DeSyRe delivers a new generation of systems that are reliable by design at well-balanced power, performance, and design costs. In our attempt to reduce the overheads of fault-tolerance, only a small fraction of the chip is built to be fault-free. This fault-free part is then employed to manage the remaining fault-prone resources of the SoC. The DeSyRe framework is applied to two medical systems with high safety requirements (measured using the IEC 61508 functional safety standard) and tight power and performance constraints

    ECO-CHIP: Estimation of Carbon Footprint of Chiplet-based Architectures for Sustainable VLSI

    Full text link
    Decades of progress in energy-efficient and low-power design have successfully reduced the operational carbon footprint in the semiconductor industry. However, this has led to an increase in embodied emissions, encompassing carbon emissions arising from design, manufacturing, packaging, and other infrastructural activities. While existing research has developed tools to analyze embodied carbon at the computer architecture level for traditional monolithic systems, these tools do not apply to near-mainstream heterogeneous integration (HI) technologies. HI systems offer significant potential for sustainable computing by minimizing carbon emissions through two key strategies: ``reducing" computation by reusing pre-designed chiplet IP blocks and adopting hierarchical approaches to system design. The reuse of chiplets across multiple designs, even spanning multiple generations of integrated circuits (ICs), can substantially reduce embodied carbon emissions throughout the operational lifespan. This paper introduces a carbon analysis tool specifically designed to assess the potential of HI systems in facilitating greener VLSI system design and manufacturing approaches. The tool takes into account scaling, chiplet and packaging yields, design complexity, and even carbon overheads associated with advanced packaging techniques employed in heterogeneous systems. Experimental results demonstrate that HI can achieve a reduction of embodied carbon emissions up to 70\% compared to traditional large monolithic systems. These findings suggest that HI can pave the way for sustainable computing practices, contributing to a more environmentally conscious semiconductor industry.Comment: Under review at HPCA2

    An Efficient Cache Organization for On-Chip Multiprocessor Networks

    Get PDF
    To meet the growing computation-intensive applications and the needs of low-power, high-performance systems, the number of computing resources in single-chip has enormously increased. By adding many computing resources to build a system in System-on-Chip, its interconnection between each other becomes another challenging issue. In most System-on-Chip applications, a shared bus interconnection which needs an arbitration logic to serialize several bus access requests, is adopted to communicate with each integrated processing unit because of its low-cost and simple control characteristics. This paper focuses on the interconnection design issues of area, power and performance of chip multi-processors with shared cache memory. It shows that having shared cache memory contributes to the performance improvement, however, typical interconnection between cores and the shared cache using crossbar occupies most of the chip area, consumes a lot of power and does not scale efficiently with increased number of cores. New interconnection mechanisms are needed to address these issues. This paper proposes an architectural paradigm in an attempt to gain the advantages of having shared cache with the avoidance of penalty imposed by the crossbar interconnect. The proposed architecture achieves smaller area occupation allowing more space to add additional cache memory. It also reduces power consumption compared to the existing crossbar architecture. Furthermore, the paper presents a modified cache coherence algorithm called Tuned-MESI. It is based on the typical MESI cache coherence algorithm however it is tuned and tailored for the suggested architecture. The achieved results of the conducted simulated experiments show that the developed architecture produces less broadcast operations compared to the typical algorithm

    Power Reductions with Energy Recovery Using Resonant Topologies

    Get PDF
    The problem of power densities in system-on-chips (SoCs) and processors has become more exacerbated recently, resulting in high cooling costs and reliability issues. One of the largest components of power consumption is the low skew clock distribution network (CDN), driving large load capacitance. This can consume as much as 70% of the total dynamic power that is lost as heat, needing elaborate sensing and cooling mechanisms. To mitigate this, resonant clocking has been utilized in several applications over the past decade. An improved energy recovering reconfigurable generalized series resonance (GSR) solution with all the critical support circuitry is developed in this work. This LC resonant clock driver is shown to save about 50% driver power (\u3e40% overall), on a 22nm process node and has 50% less skew than a non-resonant driver at 2GHz. It can operate down to 0.2GHz to support other energy savings techniques like dynamic voltage and frequency scaling (DVFS). As an example, GSR can be configured for the simpler pulse series resonance (PSR) operation to enable further power saving for double data rate (DDR) applications, by using de-skewing latches instead of flip-flop banks. A PSR based subsystem for 40% savings in clocking power with 40% driver active area reduction xii is demonstrated. This new resonant driver generates tracking pulses at each transition of clock for dual edge operation across DVFS. PSR clocking is designed to drive explicit-pulsed latches with negative setup time. Simulations using 45nm IBM/PTM device and interconnect technology models, clocking 1024 flip-flops show the reductions, compared to non-resonant clocking. DVFS range from 2GHz/1.3V to 200MHz/0.5V is obtained. The PSR frequency is set \u3e3Ă— the clock rate, needing only 1/10th the inductance of prior-art LC resonance schemes. The skew reductions are achieved without needing to increase the interconnect widths owing to negative set-up times. Applications in data circuits are shown as well with a 90nm example. Parallel resonant and split-driver non-resonant configurations as well are derived from GSR. Tradeoffs in timing performance versus power, based on theoretical analysis, are compared for the first time and verified. This enables synthesis of an optimal topology for a given application from the GSR

    A C-DAG task model for scheduling complex real-time tasks on heterogeneous platforms: preemption matters

    Full text link
    Recent commercial hardware platforms for embedded real-time systems feature heterogeneous processing units and computing accelerators on the same System-on-Chip. When designing complex real-time application for such architectures, the designer needs to make a number of difficult choices: on which processor should a certain task be implemented? Should a component be implemented in parallel or sequentially? These choices may have a great impact on feasibility, as the difference in the processor internal architectures impact on the tasks' execution time and preemption cost. To help the designer explore the wide space of design choices and tune the scheduling parameters, in this paper we propose a novel real-time application model, called C-DAG, specifically conceived for heterogeneous platforms. A C-DAG allows to specify alternative implementations of the same component of an application for different processing engines to be selected off-line, as well as conditional branches to model if-then-else statements to be selected at run-time. We also propose a schedulability analysis for the C-DAG model and a heuristic allocation algorithm so that all deadlines are respected. Our analysis takes into account the cost of preempting a task, which can be non-negligible on certain processors. We demonstrate the effectiveness of our approach on a large set of synthetic experiments by comparing with state of the art algorithms in the literature

    Datacenter Design for Future Cloud Radio Access Network.

    Full text link
    Cloud radio access network (C-RAN), an emerging cloud service that combines the traditional radio access network (RAN) with cloud computing technology, has been proposed as a solution to handle the growing energy consumption and cost of the traditional RAN. Through aggregating baseband units (BBUs) in a centralized cloud datacenter, C-RAN reduces energy and cost, and improves wireless throughput and quality of service. However, designing a datacenter for C-RAN has not yet been studied. In this dissertation, I investigate how a datacenter for C-RAN BBUs should be built on commodity servers. I first design WiBench, an open-source benchmark suite containing the key signal processing kernels of many mainstream wireless protocols, and study its characteristics. The characterization study shows that there is abundant data level parallelism (DLP) and thread level parallelism (TLP). Based on this result, I then develop high performance software implementations of C-RAN BBU kernels in C++ and CUDA for both CPUs and GPUs. In addition, I generalize the GPU parallelization techniques of the Turbo decoder to the trellis algorithms, an important family of algorithms that are widely used in data compression and channel coding. Then I evaluate the performance of commodity CPU servers and GPU servers. The study shows that the datacenter with GPU servers can meet the LTE standard throughput with 4× to 16× fewer machines than with CPU servers. A further energy and cost analysis show that GPU servers can save on average 13× more energy and 6× more cost. Thus, I propose the C-RAN datacenter be built using GPUs as a server platform. Next I study resource management techniques to handle the temporal and spatial traffic imbalance in a C-RAN datacenter. I propose a “hill-climbing” power management that combines powering-off GPUs and DVFS to match the temporal C-RAN traffic pattern. Under a practical traffic model, this technique saves 40% of the BBU energy in a GPU-based C-RAN datacenter. For spatial traffic imbalance, I propose three workload distribution techniques to improve load balance and throughput. Among all three techniques, pipelining packets has the most throughput improvement at 10% and 16% for balanced and unbalanced loads, respectively.PhDComputer Science and EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120825/1/qizheng_1.pd
    • …
    corecore