Search CORE

12,846 research outputs found

Cross-Layer Design of Highly Scalable and Energy-Efficient AI Accelerator Systems Using Photonic Integrated Circuits

Author: Sri Vatsavai Sairam
Publication venue: UKnowledge
Publication date: 01/01/2024
Field of study

Artificial Intelligence (AI) has experienced remarkable success in recent years, solving complex computational problems across various domains, including computer vision, natural language processing, and pattern recognition. Much of this success can be attributed to the advancements in deep learning algorithms and models, particularly Artificial Neural Networks (ANNs). In recent times, deep ANNs have achieved unprecedented levels of accuracy, surpassing human capabilities in some cases. However, these deep ANN models come at a significant computational cost, with billions to trillions of parameters. Recent trends indicate that the number of parameters per ANN model will continue to grow exponentially in the foreseeable future. To meet the escalating computational demands of ANN models, the hardware accelerators used for processing ANNs must offer lower latency and higher energy efficiency. Unfortunately, traditional electronic implementations of ANN hardware accelerators, including CPUs, Graphics Processing Units (GPUs), Application-Specific Integrated Circuits (ASICs), and Field Programmable Gate Arrays (FPGAs), have fallen short of meeting the latency and energy efficiency requirements for processing deep ANN models. Furthermore, the interconnection network subsystems in these electronic accelerator systems, designed to facilitate large-scale data transfers between processing cores and memory/control units within the accelerator systems, have become bottlenecks that hinder the throughput, latency, and energy efficiency of deep ANN model processing. Fortunately, Photonic Integrated Circuits (PICs)-based accelerator systems, featuring photonic network subsystems are promising alternatives to conventional electronic accelerators. PIC-based accelerator systems operate in the optical domain, delivering processing at the speed of light with ultra-low latency, minimal dynamic energy consumption, and high throughput. These advantages stem from the wavelength division multiplexing capabilities and the absence of distance-dependent impedance in PICs. Furthermore, these characteristics enable the implementation of high-performance photonic network subsystems within PIC-based accelerator systems. Additionally, PIC-based accelerator systems offer inherent optical nonlinearities. Despite these numerous advantages over electronic accelerators, PIC-based systems still encounter several challenges due to limited optical power budget, susceptibility to crosstalk and other sources of noise caused by the analog operation, high area consumption, and restricted functional flexibility of PICs. These challenges manifest in various ways. (i) The existence of a significant trade-off between the achievable processing core size and the supported bit precision that impedes the scalability of processing cores. (ii) The limited reconfigurability, in terms of supported computing size and precision, makes them less adaptable to modern ANN models with diverse computational and precision demands. (iii) The reliance on electronic adder networks for accumulation diminishes the latency and energy consumption benefits of PIC-based accelerator systems due to frequent analog-to-digital conversions and memory accesses involved in accumulations. My research has contributed several solutions that overcome a multitude of these challenges and improve the throughput, energy efficiency, and flexibility of PIC-based AI accelerator systems. I identified and analyzed factors that affect the scalability and reconfigurability of PIC-based AI accelerator systems. I proposed several novel PIC-based accelerator architectures with enhancements at the circuit level, architecture level, and system level to improve scalability, reconfigurability, and functional flexibility. At the circuit level, these enhancements serve to decrease optical signal losses, reduce control complexity, enable adaptability for various ANN processing tasks, and lower power and area consumption. The architecture-level improvements mitigate crosstalk noise, facilitate functional reconfigurability, enable in-situ and flexible spatio-temporal accumulation, and provide flexible support for different dataflows. The system-level enhancements involve the integration of stochastic computing with PIC-based accelerators to break the inherent trade-off between scalability and supported bit precision. Additionally, applying stochastic computing enhances the flexibility of PIC-based accelerators, allowing them to support mixed-precision ANN models. These cross-layer enhancements collectively contribute to the design of PIC-based AI accelerator systems, resulting in improved throughput, energy efficiency, scalability, and reconfigurability

University of Kentucky

Multiple Independent Gate FETs: How Many Gates Do We Need?

Author: Amarù Luca
De Micheli Giovanni
Gaillardon Pierre-Emmanuel
Hills Gage
Mitra Subhasish
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 04/09/2014
Field of study

Multiple Independent Gate Field Effect Transistors (MIGFETs) are expected to push FET technology further into the semiconductor roadmap. In a MIGFET, supplementary gates either provide (i) enhanced conduction properties or (ii) more intelligent switching functions. In general, each additional gate also introduces a side implementation cost. To enable more efficient digital systems, MIGFETs must leverage their expressive power to realize complex logic circuits with few physical resources. Researchers face then the question: How many gates do we need? In this paper, we address the logic side of this question. We determine whether or not an increasing number of gates leads to more compact logic implementations. For this purpose, we de- velop a logic synthesis flow that intrinsically exploits a MIGFET switching function. Using simplified design assumptions and device/interconnect models, we synthesize MCNC benchmarks on 5 promising MIGFET devices, with number of gates ranging from 1 to 7. Experimental results evidence nontrivial area/delay/energy minima, located between 1 and 4 gates, depending on a MIGFET switching function and device/interconnect technology

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Semiconductor Memory Applications in Radiation Environment, Hardware Security and Machine Learning System

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications. In the radiation environment, e.g. aerospace, the memory devices face different energetic particles. The strike of these energetic particles can generate electron-hole pairs (directly or indirectly) as they pass through the semiconductor device, resulting in photo-induced current, and may change the memory state. First, the trend of radiation effects of the mainstream memory technologies with technology node scaling is reviewed. Then, single event effects of the oxide based resistive switching random memory (RRAM), one of eNVM technologies, is investigated from the circuit-level to the system level. Physical Unclonable Function (PUF) has been widely investigated as a promising hardware security primitive, which employs the inherent randomness in a physical system (e.g. the intrinsic semiconductor manufacturing variability). In the dissertation, two RRAM-based PUF implementations are proposed for cryptographic key generation (weak PUF) and device authentication (strong PUF), respectively. The performance of the RRAM PUFs are evaluated with experiment and simulation. The impact of non-ideal circuit effects on the performance of the PUFs is also investigated and optimization strategies are proposed to solve the non-ideal effects. Besides, the security resistance against modeling and machine learning attacks is analyzed as well. Deep neural networks (DNNs) have shown remarkable improvements in various intelligent applications such as image classification, speech classification and object localization and detection. Increasing efforts have been devoted to develop hardware accelerators. In this dissertation, two types of compute-in-memory (CIM) based hardware accelerator designs with SRAM and eNVM technologies are proposed for two binary neural networks, i.e. hybrid BNN (HBNN) and XNOR-BNN, respectively, which are explored for the hardware resource-limited platforms, e.g. edge devices.. These designs feature with high the throughput, scalability, low latency and high energy efficiency. Finally, we have successfully taped-out and validated the proposed designs with SRAM technology in TSMC 65 nm. Overall, this dissertation paves the paths for memory technologies’ new applications towards the secure and energy-efficient artificial intelligence system.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

An Efficient Gate Library for Ambipolar CNTFET Logic

Author: Ben Jamaa Haykel
De Micheli Giovanni
Mohanram Kartik
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 22/11/2010
Field of study

Recently, several emerging technologies have been reported as potential candidates for controllable ambipolar devices. Controllable ambipolarity is a desirable property that enables the on-line configurability of n-type and p-type device polarity. In this paper, we introduce a new design methodology for logic gates based on controllable ambipolar devices, with an emphasis on carbon nanotubes as the candidate technology. Our technique results in ambipolar gates with a higher expressive power than conventional complementary metal-oxidesemiconductor (CMOS) libraries. We propose a library of static ambipolar carbon nanotube field effect transistor (CNTFET) gates based on generalized NOR-NAND-AOI-OAI primitives, which efficiently implements XOR-based functions. Technology mapping of several multi-level logic benchmarks that extensively use the XOR function, including multipliers, adders, and linear circuits, with ambipolar CNTFET logic gates indicates that on average, it is possible to reduce the number of logic levels by 42%, the delay by 26%, and the power consumption by 32%, resulting in a energy-delay-product (EDP) reduction of 59% over the same circuits mapped with unipolar CNTFET logic gates. Based on the projections in [1], where it is stated that defectfree CNTFETs will provide a 5× performance improvement over metal-oxide-semiconductor field effect transistors, the ambipolar library provides a performance improvement of 7×, a 57% reduction in power consumption, and a 20× improvement in EDP over the CMOS library

Infoscience - École polytechnique fédérale de Lausanne

Multifunctional Spin Logic Operations in Graphene Spin Circuits

Author: Dash Saroj Prasad
Datta Supriyo
Karpiak Bogdan
Khokhriakov Dmitrii
Md Hoque Anamul
Sayed Shehrin
Zhao Bing
Publication venue: 'American Physical Society (APS)'
Publication date: 01/01/2022
Field of study

Spin-based computing, combining logic and nonvolatile magnetic memory, is promising for emerg-ing information technologies. However, the realization of a universal spin logic operation, representing a reconfigurable building block with all-electrical spin-current communication, has so far remained chal-lenging. Here, we experimentally demonstrate reprogrammable all-electrical multifunctional spin logic operations in a nanoelectronic device architecture, utilizing graphene buses for spin communication and mixing and nanomagnets for writing and reading information at room temperature. This device realizes a multistate spin-majority logic operation, which is reconfigured to achieve (N)AND, (N)OR, and XNOR Boolean operations, depending on the magnetization of inputs. The results are in good agreement with the predictions from a spin-circuit model, providing an experimental demonstration of a spin-based logic unit that takes advantage of the vector nature of spin, as opposed to conventional scalar charge-based devices. These spin logic operations in large-area graphene are fully compatible with industrial fabrication pro-cesses and represent a promising platform for scalable all-electric spin-based logic-in-memory computing architecture

Chalmers Research

Recommended from our members

Low-power high-performance 32-bit 0.5[u]m CMOS adder

Author: Shah Parag Shantu
Publication venue: 'Oregon State University'
Publication date
Field of study

Currently, the two most critical factors of microprocessor design are performance and power. The optimum balance of these two factors is reflected in the speed-power product(SPP). 32-bit CMOS adders are used as representative circuits to investigate a method of reducing the SPP. The purpose of this thesis is to show that sizing gates according to fan-out and removing buffer drivers can reduce the SPP. This thesis presents a method for sizing gates in large fan-out parallel prefix circuits to reduce the SPP and compares it to other methods. Three different parallel prefix adders are used to compare propagation delay and SPP. The first adder uses the depth-optimal prefix circuit. The second adder is based on Wei, Thompson, and Chen's time-optimal adder. The third adder uses a recursive doubling formation where all cells have minimum transistor width dimensions. The component cells in the adders are static CMOS as described by Brent and Kung. For all circuits, the smallest propagation delay occurs when the highest voltage supply is applied. The smallest SPP occurs when the lowest voltage supply is applied, but with the lowest performance. The Recursive Doubling Adder always has the lowest propagation delay for a particular set of parameters. However, its SPP is nearly equal to the Brent-Kung Adders and lower than Wei's Adder. The power-frequency analysis reveals that a decrease in Vt causes higher power consumption due to leakage

ScholarsArchive@OSU

Techniques of Energy-Efficient VLSI Chip Design for High-Performance Computing

Author: Zhao Zhou
Publication venue: LSU Digital Commons
Publication date: 13/09/2018
Field of study

How to implement quality computing with the limited power budget is the key factor to move very large scale integration (VLSI) chip design forward. This work introduces various techniques of low power VLSI design used for state of art computing. From the viewpoint of power supply, conventional in-chip voltage regulators based on analog blocks bring the large overhead of both power and area to computational chips. Motivated by this, a digital based switchable pin method to dynamically regulate power at low circuit cost has been proposed to make computing to be executed with a stable voltage supply. For one of the widely used and time consuming arithmetic units, multiplier, its operation in logarithmic domain shows an advantageous performance compared to that in binary domain considering computation latency, power and area. However, the introduced conversion error reduces the reliability of the following computation (e.g. multiplication and division.). In this work, a fast calibration method suppressing the conversion error and its VLSI implementation are proposed. The proposed logarithmic converter can be supplied by dc power to achieve fast conversion and clocked power to reduce the power dissipated during conversion. Going out of traditional computation methods and widely used static logic, neuron-like cell is also studied in this work. Using multiple input floating gate (MIFG) metal-oxide semiconductor field-effect transistor (MOSFET) based logic, a 32-bit, 16-operation arithmetic logic unit (ALU) with zipped decoding and a feedback loop is designed. The proposed ALU can reduce the switching power and has a strong driven-in capability due to coupling capacitors compared to static logic based ALU. Besides, recent neural computations bring serious challenges to digital VLSI implementation due to overload matrix multiplications and non-linear functions. An analog VLSI design which is compatible to external digital environment is proposed for the network of long short-term memory (LSTM). The entire analog based network computes much faster and has higher energy efficiency than the digital one

Louisiana State University

COMPUTE-IN-MEMORY WITH EMERGING NON-VOLATILE MEMORIES FOR ACCELERATING DEEP NEURAL NETWORKS

Author: Sun Xiaoyu
Publication venue: Georgia Institute of Technology
Publication date: 15/09/2021
Field of study

The objective of this research is to accelerate deep neural networks (DNNs) with emerging non-volatile memories (eNVMs) based compute-in-memory (CIM) architecture. The research first focuses on the inference acceleration and proposes a resistive random access memory (RRAM) based CIM architecture. Two generations of RRAM testchips which monolithically integrate the RRAM memory array and CMOS peripheral circuits are designed and fabricated using Winbond 90 nm and TSMC 40 nm commercial embedded RRAM process respectively. The first generation of testchip named XNOR-RRAM is dedicated for binary neural networks (BNNs) and the second generation named Flex-RRAM features 1bit-to-8bit run-time configurable precision and leverages the input sparsity of the DNN model to improve the throughput and energy efficiency. However, the non-ideal characteristics of eNVM devices, especially when utilized as multi-level analog synaptic weights, may incur a notable accuracy degradation for both training and inference. This research develops a PyTorch based framework that incorporates the device characteristics into the DNN model to evaluate the impact of the eNVM nonidealities on training/inference accuracy. The results suggest that it is challenging to directly use eNVMs for in-situ training and resistance drift remains as a critical challenge to maintain a high inference accuracy. Furthermore, to overcome the challenges posed by the asymmetric conductance tuning behavior of typical eNVMs, which is found to be the most critical nonideality that prevents the model from achieving software equivalent training accuracy, this research proposes a novel 2-transistor-1-FeFET (ferroelectric field effect transistor) based synaptic weight cell that exploits hybrid precision for in situ training and inference, which achieves near-software classification accuracy on MNIST and CIFAR-10 dataset.Ph.D

Scholarly Materials And Research @ Georgia Tech

Reinventing Integrated Photonic Devices and Circuits for High Performance Communication and Computing Applications

Author: Karempudi Venkata Sai Praneeth
Publication venue: UKnowledge
Publication date: 01/01/2024
Field of study

The long-standing technological pillars for computing systems evolution, namely Moore\u27s law and Von Neumann architecture, are breaking down under the pressure of meeting the capacity and energy efficiency demands of computing and communication architectures that are designed to process modern data-centric applications related to Artificial Intelligence (AI), Big Data, and Internet-of-Things (IoT). In response, both industry and academia have turned to \u27more-than-Moore\u27 technologies for realizing hardware architectures for communication and computing. Fortunately, Silicon Photonics (SiPh) has emerged as one highly promising ‘more-than-Moore’ technology. Recent progress has enabled SiPh-based interconnects to outperform traditional electrical interconnects, offering advantages like high bandwidth density, near-light speed data transfer, distance-independent bitrate, and low energy consumption. Furthermore, SiPh-based electro-optic (E-O) computing circuits have exhibited up to two orders of magnitude improvements in performance and energy efficiency compared to their electronic counterparts. Thus, SiPh stands out as a compelling solution for creating high-performance and energy-efficient hardware for communication and computing applications. Despite their advantages, SiPh-based interconnects face various design challenges that hamper their reliability, scalability, performance, and energy efficiency. These include limited optical power budget (OPB), high static power dissipation, crosstalk noise, fabrication and on-chip temperature variations, and limited spectral bandwidth for multiplexing. Similarly, SiPh-based E-O computing circuits also face several challenges. Firstly, the E-O circuits for simple logic functions lack the all-electrical input handling, raising hardware area and complexity. Secondly, the E-O arithmetic circuits occupy vast areas (at least 100x) while hardly achieving more than 60% hardware utilization, versus CMOS implementations, leading to high idle times, and non-amortizable area and static power overheads. Thirdly, the high area overhead of E-O circuits hinders them from achieving high spatial parallelism on-chip. This is because the high area overhead limits the count of E-O circuits that can be implemented on a reticle-size limited chip. My research offers significant contributions to address the aforementioned challenges. For SiPh-based interconnects, my contributions focus on enhancing OPB by mitigating crosstalk noise, addressing the optical non-linearity-related issues through the development of Silicon-on-Sapphire-based photonic interconnects, exploring multi-level signaling, and evaluating various device-level design pathways. This enables the design of high throughput (\u3e1Tbps) and energy-efficient (\u3c1pJ/bit) SiPh interconnects. In the context of SiPh-based E-O circuits, my contributions include the design of a microring-based polymorphic E-O logic gate, a hybrid time-amplitude analog optical modulator, and an indium tin oxide-based silicon nitride microring modulator and a weight bank for neural network computations. These designs significantly reduce the area overhead of current E-O computing circuits while enhancing the energy-efficiency, and hardware utilization

University of Kentucky