Search CORE

81 research outputs found

Cost-Effective Design of Mesh-of-Tree Interconnect for Multi-Core Clusters with 3-D Stacked L2 Scratchpad Memory

Author: Benini Luca
De Micheli Giovanni
Kang Kyungsu
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 09/10/2014
Field of study

3-D integrated circuits (3-D ICs) offer a promising solution to overcome the scaling limitations of 2-D ICs. However, using too many through-silicon-vias (TSVs) pose a negative impact on 3-D ICs due to the large overhead of TSV (e.g., large footprint and low yield). In this paper, we propose a new TSV sharing method for a circuit-switched 3-D mesh-of-tree (MoT) interconnect, which supports high-throughput and low-latency communication between processing cores and 3-D stacked multibanked L2 scratchpad memory. The proposed method supports traffic balancing and TSV-failure tolerant routing. The proposed method advocates a modular design strategy to allow stacking multiple identical memory dies without the need for different masks for dies at different levels in the memory stack. We also investigate various parameters of 3-D memory stacking (e.g., fabrication technology, TSV bonding technique, number of memory tiers, and TSV sharing scheme) that affect interconnect latency, system performance, and fabrication cost. Compared to conventional MoT interconnect that is straightforwardly adapted to 3-D integration, the proposed method yields up to (times 2.11) and (times 1.11) improvements in terms of cost efficiency (i.e., performance/cost) for microbump TSV bonding and direct Cu–Cu TSV bonding techniques, respectively

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing

Author: Benini L.
Loi I.
Pullini A.
Rossi D.
Tagliavini G.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2019
Field of study

This paper presents Mr. Wolf, a parallel ultra-low power (PULP) system on chip (SoC) featuring a hierarchical architecture with a small (12 kgates) microcontroller (MCU) class RISC-V core augmented with an autonomous IO subsystem for efficient data transfer from a wide set of peripherals. The small core can offload compute-intensive kernels to an eight-core floating-point capable of processing engine available on demand. The proposed SoC, implemented in a 40-nm LP CMOS technology, features a 108-mu W fully retentive memory (512 kB). The IO subsystem is capable of transferring up to 1.6 Gbit/s from external devices to the memory in less than 2.5 mW. The eight-core compute cluster achieves a peak performance of 850 million of 32-bit integer multiply and accumulate per second (MMAC/s) and 500 million of 32-bit floating-point multiply and accumulate per second (MFMAC/s) -1 GFlop/s-with an energy efficiency up to 15 MMAC/s/mW and 9 MFMAC/s/mW. These building blocks are supported by aggressive on-chip power conversion and management, enabling energy-proportional heterogeneous computing for always-on IoT end nodes improving performance by several orders of magnitude with respect to traditional single-core MCUs within a power envelope of 153 mW. We demonstrated the capabilities of the proposed SoC on a wide set of near-sensor processing kernels showing that Mr. Wolf can deliver performance up to 16.4 GOp/s with energy efficiency up to 274 MOp/s/mW on real-life applications, paving the way for always-on data analytics on high-bandwidth sensors at the edge of the Internet of Things

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

A Power-Efficient 3-D On-Chip Interconnect for Multi-Core Accelerators with Stacked L2 Cache

Author: Benini Luca
De Micheli Giovanni
Kang Kyungsu
Lee Jong-Bae
Park Sangho
Publication venue: New York, IEEE
Publication date: 01/01/2016
Field of study

The use of multi-core clusters is a promising option for data-intensive embedded applications such as multi-modal sensor fusion, image understanding, mobile augmented reality. In this paper, we propose a power-efficient 3-D on-chip interconnect for multi-core clusters with stacked L2 cache memory. A new switch design makes a circuit-switched Mesh-of-Tree (MoT) interconnect reconfigurable to support power-gating of processing cores, memory blocks, and unnecessary interconnect resources (routing switch, arbitration switch, inverters placed along the on-chip wires). The proposed 3-D MoT improves the power efficiency up to 77% in terms of energy-delay product (EDP)

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Recommended from our members

Scalable System-on-Chip Design

Author: Mantovani Paolo
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2017
Field of study

The crisis of technology scaling led the industry of semiconductors towards the adoption of disruptive technologies and innovations to sustain the evolution of microprocessors and keep under control the timing of the design cycle. Multi-core and many-core architectures sought more energy-efficient computation by replacing a power-hungry processor with multiple simpler cores exploiting parallelism. Multi-core processors alone, however, turned out to be insufficient to sustain the ever growing demand for energy and power-efficient computation without compromising performance. Therefore, designers were pushed to drift from homogeneous architectures towards more complex heterogeneous systems that employ the large number of available transistors to incorporate a combination of customized energy-efficient accelerators, along with the general-purpose processor cores. Meanwhile, enhancements in manufacturing processes allowed designers to move a variety of peripheral components and analog devices into the chip. This paradigm shift defined the concept of {\em system-on-chip} (SoC) as a single-chip design that integrates several heterogeneous components. The rise of SoCs corresponds to a rapid decrease of the opportunity cost for integrating accelerators. In fact, on one hand, employing more transistors for powerful cores is not feasible anymore, because transistors cannot be active all at once within reasonable power budgets. On the other hand, increasing the number of homogeneous cores incurs more and more diminishing returns. The availability of cost effective silicon area for specialized hardware creates an opportunity to enter the market of semiconductors for new small players: engineers from several different scientific areas can develop competitive algorithms suitable for acceleration for domain-specific applications, such as multimedia systems, self-driving vehicles, robotics, and more. However, turning these algorithms into SoC components, referred to as {\em intellectual property}, still requires expert hardware designers who are typically not familiar with the specific domain of the target application. Furthermore, heterogeneity makes SoC design and programming much more difficult, especially because of the challenges of the integration process. This is a fine art in the hands of few expert engineers who understand system-level trade-offs, know how to design good hardware, how to handle memory and power management, how to shape and balance the traffic over an interconnect, and are able to deal with many different hardware-software interfaces. Designers need solutions enabling them to build scalable and heterogeneous SoCs. My thesis is that {\em the key to scalable SoC designs is a regular and flexible architecture that hides the complexity of heterogeneous integration from designers, while helping them focus on the important aspects of domain-specific applications through a companion system-level design methodology.} I open a path towards this goal by proposing an architecture that mitigates heterogeneity with regularity and addresses the challenges of heterogeneous component integration by implementing a set of {\em platform services}. These are hardware and software interfaces that from a system-level viewpoint give the illusion of working with a homogeneous SoC, thus making it easier to reuse accelerators and port applications across different designs, each with its own target workload and cost-performance trade-off point. A companion system-level design methodology exploits the regularity of the architecture to guide designers in implementing their intellectual property and enables an extensive design-space exploration across multiple levels of abstraction. Throughout the dissertation, I present a fully automated flow to deploy heterogeneous SoCs on single or multiple field-programmable-gate-array devices. The flow provides non-expert designers with a set of knobs for tuning system-level features based on the given mix of accelerators that they have integrated. Many contributions of my dissertation have already influenced other research projects as well as the content of an advanced course for graduate and senior undergraduate students, which aims to form a new generation of system-level designers. These new professionals need not to be circuit or register-transfer level design experts, and not even gurus of operating systems. Instead, they are trained to design efficient intellectual property by considering system-level trade-offs, while the architecture and the methodology that I describe in this dissertation empower them to integrate their components into an SoC. Finally, with the open-source release of the entire infrastructure, including the SoC-deployment flow and the software stack, I hope I will be able to inspire other research groups and help them implement ideas that further reduce the cost and design-time of future heterogeneous systems

Columbia University Academic Commons

Memory Hierarchy Design for Next Generation Scalable Many-core Platforms

Author: Azarkhish Erfan <1985>
Publication venue: Alma Mater Studiorum - Università di Bologna
Publication date: 09/06/2016
Field of study

Performance and energy consumption in modern computing platforms is largely dominated by the memory hierarchy. The increasing computational power in the multiprocessors and accelerators, and the emergence of the data-intensive workloads (e.g. large-scale graph traversal and scientific algorithms) requiring fast transfer of large volumes of data, are two main trends which intensify this problem by putting even higher pressure on the memory hierarchy. This increasing gap between computation speed and data transfer speed is commonly referred as the “memory wall” problem. With the emergence of heterogeneous Three Dimensional (3D) Integration based on through-silicon-vias (TSV), this situation has started to recover in the past years. On one hand, it is now possible to improve memory access bandwidth and/or latency by either stacking memories directly on top of processors or through abstracted memory interfaces such as Micron’s Hybrid Memory Cube (HMC). On the other hand, near memory computation has become worthy of revisiting due to the cost-effective integration of logic and memory in 3D stacks. These two directions bring about several interesting opportunities including performance improvement, energy and cost reduction, product miniaturization, and modular design for improved time to market. In this research, we study the effectiveness of the 3D integration technology and the optimization opportunities which it can provide in the different layers of the memory hierarchy in cluster-based many-core platforms ranging from intra-cluster L1 to inter-cluster L2 scratchpad memories (SPMs), as well as the main memory. In addition, by moving a part of the computation to where data resides, in the 3D-stacked memory context, we demonstrate further energy and performance improvement opportunities

AMS Tesi di Dottorato

Rapid SoC Design: On Architectures, Methodologies and Frameworks

Author: Ajayi Adetutu
Publication venue
Publication date: 01/01/2020
Field of study

Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Moore’s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennard’s scaling and a slowdown in Moore’s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms. The various techniques adopted to combat the slowing of Moore’s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances. This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd

Deep Blue Documents at the University of Michigan

High-Speed Performance, Power and Thermal Co-simulation For SoC Design

Author: Varma Ankush
Publication venue
Publication date: 01/05/2007
Field of study

This dissertation presents a multi-faceted effort at developing standard System Design Language based tools that allow designers to the model power and thermal behavior of SoCs, including heterogeneous SoCs that include non-digital components. The research contributions made in this dissertation include: • SystemC-based power/performance co-simulation for the Intel XScale microprocessor. We performed detailed characterization of the power dissipation patterns of a variety of system components and used these results to build detailed power models, including a highly accurate, validated instruction-level power model of the XScale processor. We also proposed a scalable, efficient and validated methodology for incorporating fast, accurate power modeling capabilities into system description languages such as SystemC. This was validated against physical measurements of hardware power dissipation. • Modeling the behavior of non-digital SoC components within standard System Design Languages. We presented an approach for modeling the functionality, performance, power, and thermal behavior of a complex class of non-digital components — MEMS microhotplate-based gas sensors — within a SystemC design framework. The components modeled include both digital components (such as microprocessors, busses and memory) and MEMS devices comprising a gas sensor SoC. The first SystemC models of a MEMS-based SoC and the first SystemC models of MEMS thermal behavior were described. Techniques for significantly improving simulation speed were proposed, and their impact quantified. • Vertically Integrated Execution-Driven Power, Performance and Thermal Co-Simulation For SoCs. We adapted the above techniques and used numerical methods to model the system of differential equations that governs on-chip thermal diffusion. This allows a single high-speed simulation to span performance, power and thermal modeling of a design. It also allows feedback behaviors, such as the impact of temperature on power dissipation or performance, to be modeled seamlessly. We validated the thermal equation-solving engine on test layouts against detailed low-level tools, and illustrated the power of such a strategy by demonstrating a series of studies that designers can perform using such tools. We also assessed how simulation and accuracy are impacted by spatial and temporal resolution used for thermal modeling

Digital Repository at the University of Maryland

Different approaches on seu mitigation techniques for PLDS

Author: Martín-Ortega Alberto
Publication venue
Publication date: 01/01/2008
Field of study

Biblos-e Archivo