Search CORE

189 research outputs found

Reconfigurable Architectures and Systems for IoT Applications

Author
Publication venue
Publication date: 01/01/2016
Field of study

abstract: Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software and services of IoT system focus on data collection and processing to make decisions, the underlying hardware is responsible for sensing the information, preprocess and transmit it to the servers. Since the IoT ecosystem is still in infancy, there is a great need for rapid prototyping platforms that would help accelerate the hardware design process. However, depending on the target IoT application, different sensors are required to sense the signals such as heart-rate, temperature, pressure, acceleration, etc., and there is a great need for reconfigurable platforms that can prototype different sensor interfacing circuits. This thesis primarily focuses on two important hardware aspects of an IoT system: (a) an FPAA based reconfigurable sensing front-end system and (b) an FPGA based reconfigurable processing system. To enable reconfiguration capability for any sensor type, Programmable ANalog Device Array (PANDA), a transistor-level analog reconfigurable platform is proposed. CAD tools required for implementation of front-end circuits on the platform are also developed. To demonstrate the capability of the platform on silicon, a small-scale array of 24×25 PANDA cells is fabricated in 65nm technology. Several analog circuit building blocks including amplifiers, bias circuits and filters are prototyped on the platform, which demonstrates the effectiveness of the platform for rapid prototyping IoT sensor interfaces. IoT systems typically use machine learning algorithms that run on the servers to process the data in order to make decisions. Recently, embedded processors are being used to preprocess the data at the energy-constrained sensor node or at IoT gateway, which saves considerable energy for transmission and bandwidth. Using conventional CPU based systems for implementing the machine learning algorithms is not energy-efficient. Hence an FPGA based hardware accelerator is proposed and an optimization methodology is developed to maximize throughput of any convolutional neural network (CNN) based machine learning algorithm on a resource-constrained FPGA.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

Rapid SoC Design: On Architectures, Methodologies and Frameworks

Author: Ajayi Adetutu
Publication venue
Publication date: 01/01/2020
Field of study

Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Moore’s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennard’s scaling and a slowdown in Moore’s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms. The various techniques adopted to combat the slowing of Moore’s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances. This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd

Deep Blue Documents at the University of Michigan

NeuroSim Simulator for Compute-in-Memory Hardware Accelerator: Validation and Benchmark

Author: Anni Lu
Hongwu Jiang
Shimeng Yu
Wantong Li
Xiaochen Peng
Publication venue: 'Frontiers Media SA'
Publication date: 01/06/2021
Field of study

Compute-in-memory (CIM) is an attractive solution to process the extensive workloads of multiply-and-accumulate (MAC) operations in deep neural network (DNN) hardware accelerators. A simulator with options of various mainstream and emerging memory technologies, architectures, and networks can be a great convenience for fast early-stage design space exploration of CIM hardware accelerators. DNN+NeuroSim is an integrated benchmark framework supporting flexible and hierarchical CIM array design options from a device level, to a circuit level and up to an algorithm level. In this study, we validate and calibrate the prediction of NeuroSim against a 40-nm RRAM-based CIM macro post-layout simulations. First, the parameters of a memory device and CMOS transistor are extracted from the foundry’s process design kit (PDK) and employed in the NeuroSim settings; the peripheral modules and operating dataflow are also configured to be the same as the actual chip implementation. Next, the area, critical path, and energy consumption values from the SPICE simulations at the module level are compared with those from NeuroSim. Some adjustment factors are introduced to account for transistor sizing and wiring area in the layout, gate switching activity, post-layout performance drop, etc. We show that the prediction from NeuroSim is precise with chip-level error under 1% after the calibration. Finally, the system-level performance benchmark is conducted with various device technologies and compared with the results before the validation. The general conclusions stay the same after the validation, but the performance degrades slightly due to the post-layout calibration

Directory of Open Access Journals

Semiconductor Memory Applications in Radiation Environment, Hardware Security and Machine Learning System

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications. In the radiation environment, e.g. aerospace, the memory devices face different energetic particles. The strike of these energetic particles can generate electron-hole pairs (directly or indirectly) as they pass through the semiconductor device, resulting in photo-induced current, and may change the memory state. First, the trend of radiation effects of the mainstream memory technologies with technology node scaling is reviewed. Then, single event effects of the oxide based resistive switching random memory (RRAM), one of eNVM technologies, is investigated from the circuit-level to the system level. Physical Unclonable Function (PUF) has been widely investigated as a promising hardware security primitive, which employs the inherent randomness in a physical system (e.g. the intrinsic semiconductor manufacturing variability). In the dissertation, two RRAM-based PUF implementations are proposed for cryptographic key generation (weak PUF) and device authentication (strong PUF), respectively. The performance of the RRAM PUFs are evaluated with experiment and simulation. The impact of non-ideal circuit effects on the performance of the PUFs is also investigated and optimization strategies are proposed to solve the non-ideal effects. Besides, the security resistance against modeling and machine learning attacks is analyzed as well. Deep neural networks (DNNs) have shown remarkable improvements in various intelligent applications such as image classification, speech classification and object localization and detection. Increasing efforts have been devoted to develop hardware accelerators. In this dissertation, two types of compute-in-memory (CIM) based hardware accelerator designs with SRAM and eNVM technologies are proposed for two binary neural networks, i.e. hybrid BNN (HBNN) and XNOR-BNN, respectively, which are explored for the hardware resource-limited platforms, e.g. edge devices.. These designs feature with high the throughput, scalability, low latency and high energy efficiency. Finally, we have successfully taped-out and validated the proposed designs with SRAM technology in TSMC 65 nm. Overall, this dissertation paves the paths for memory technologies’ new applications towards the secure and energy-efficient artificial intelligence system.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

AI/ML Algorithms and Applications in VLSI Design and Technology

Author: Abbas Zia
Ahmad Amir
Amuru Deepthi
Cherupally Pavan K.
Gurram Sushanth R.
Vudumula Harsha V.
Zahra Andleeb
Publication venue
Publication date: 15/02/2023
Field of study

An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations

arXiv.org e-Print Archive

초미세 회로 설계를 위한 인터커넥트의 타이밍 분석 및 디자인 룰 위반 예측

Author: 한창호
Publication venue: 서울대학교 대학원
Publication date: 01/02/2021
Field of study

학위논문 (박사) -- 서울대학교 대학원 : 공과대학 전기·컴퓨터공학부, 2021. 2. 김태환.타이밍 분석 및 디자인 룰 위반 제거는 반도체 칩 제조를 위한 마스크 제작 전에 완료되어야 할 필수 과정이다. 그러나 트랜지스터와 인터커넥트의 변이가 증가하고 있고 디자인 룰 역시 복잡해지고 있기 때문에 타이밍 분석 및 디자인 룰 위반 제거는 초미세 회로에서 더 어려워지고 있다. 본 논문에서는 초미세 설계를 위한 두가지 문제인 타이밍 분석과 디자인 룰 위반에 대해 다룬다. 첫번째로 공정 코너에서 타이밍 분석은 실리콘으로 제작된 회로의 성능을 정확히 예측하지 못한다. 그 이유는 공정 코너에서 가장 느린 타이밍 경로가 모든 공정 조건에서도 가장 느린 것은 아니기 때문이다. 게다가 칩 내의 임계 경로에서 인터커넥트에 의한 지연 시간이 전체 지연 시간에서의 영향이 증가하고 있고, 10나노 이하 공정에서는 20%를 초과하고 있다. 즉, 실리콘으로 제작된 회로의 성능을 정확히 예측하기 위해서는 대표 회로가 트랜지스터의 변이 뿐만아니라 인터커넥트의 변이도 반영해야한다. 인터커넥트를 구성하는 금속이 10층 이상 사용되고 있고, 각 층을 구성하는 금속의 저항과 캐패시턴스와 비아 저항이 모두 회로 지연 시간에 영향을 주기 때문에 대표 회로를 찾는 문제는 차원이 매우 높은 영역에서 최적의 해를 찾는 방법이 필요하다. 이를 위해 인터커넥트를 제작하는 공정(백 엔드 오브 라인)의 변이를 반영한 대표 회로를 생성하는 방법을 제안하였다. 공정 변이가 없을때 가장 느린 타이밍 경로에 사용된 게이트와 라우팅 패턴을 변경하면서 점진적으로 탐색하는 방법이다. 구체적으로, 본 논문에서 제안하는 합성 프레임워크는 다음의 새로운 기술들을 통합하였다: (1) 라우팅을 구성하는 여러 금속 층과 비아를 추출하고 탐색 시간 감소를 위해 유사한 구성들을 같은 범주로 분류하였다. (2) 빠르고 정확한 타이밍 분석을 위하여 여러 금속 층과 비아들의 변이를 수식화하였다. (3) 확장성을 고려하여 일반적인 링 오실레이터로 대표회로를 탐색하였다. 두번째로 디자인 룰의 복잡도가 증가하고 있고, 이로 인해 표준 셀들의 인터커넥트를 통한 연결을 진행하는 동안 디자인 룰 위반이 증가하고 있다. 게다가 표준 셀의 크기가 계속 작아지면서 셀들의 연결은 점점 어려워지고 있다. 기존에는 회로 내 모든 표준 셀을 연결하는데 필요한 트랙 수, 가능한 트랙 수, 이들 간의 차이를 이용하여 연결 가능성을 판단하고, 디자인 룰 위반이 발생하지 않도록 셀 배치를 최적화하였다. 그러나 기존 방법은 최신 공정에서는 정확하지 않기 때문에 더 많은 정보를 이용한 회로내 모든 표준 셀 사이의 연결 가능성을 예측하는 방법이 필요하다. 본 논문에서는 기계 학습을 통해 디자인 룰 위반이 발생하는 영역 및 개수를 예측하고 이를 줄이기 위해 표준 셀의 배치를 바꾸는 방법을 제안하였다. 디자인 룰 위반 영역은 이진 분류로 예측하였고 표준 셀의 배치는 디자인 룰 위반 개수를 최소화하는 방향으로 최적화를 수행하였다. 제안하는 프레임워크는 다음의 세가지 기술로 구성되었다: (1) 회로 레이아웃을 여러 개의 정사각형 격자로 나누고 각 격자에서 라우팅을 예측할 수 있는 요소들을 추출한다. (2) 각 격자에서 디자인 룰 위반이 있는지 여부를 판단하는 이진 분류를 수행한다. (3) 메타휴리스틱 최적화 또는 베이지안 최적화를 이용하여 전체 디자인 룰 위반 개수가 감소하도록 각 격자에 있는 표준 셀을 움직인다.Timing analysis and clearing design rule violations are the essential steps for taping out a chip. However, they keep getting harder in deep sub-micron circuits because the variations of transistors and interconnects have been increasing and design rules have become more complex. This dissertation addresses two problems on timing analysis and design rule violations for synthesizing deep sub-micron circuits. Firstly, timing analysis in process corners can not capture post-Si performance accurately because the slowest path in the process corner is not always the slowest one in the post-Si instances. In addition, the proportion of interconnect delay in the critical path on a chip is increasing and becomes over 20% in sub-10nm technologies, which means in order to capture post-Si performance accurately, the representative critical path circuit should reflect not only FEOL (front-end-of-line) but also BEOL (backend-of-line) variations. Since the number of BEOL metal layers exceeds ten and the layers have variation on resistance and capacitance intermixed with resistance variation on vias between them, a very high dimensional design space exploration is necessary to synthesize a representative critical path circuit which is able to provide an accurate performance prediction. To cope with this, I propose a BEOL-aware methodology of synthesizing a representative critical path circuit, which is able to incrementally explore, starting from an initial path circuit on the post-Si target circuit, routing patterns (i.e., BEOL reconfiguring) as well as gate resizing on the path circuit. Precisely, the synthesis framework of critical path circuit integrates a set of novel techniques: (1) extracting and classifying BEOL configurations for lightening design space complexity, (2) formulating BEOL random variables for fast and accurate timing analysis, and (3) exploring alternative (ring oscillator) circuit structures for extending the applicability of this work. Secondly, the complexity of design rules has been increasing and results in more design rule violations during routing. In addition, the size of standard cell keeps decreasing and it makes routing harder. In the conventional P&R flow, the routability of pre-routed layout is predicted by routing congestion obtained from global routing, and then placement is optimized not to cause design rule violations. But it turned out to be inaccurate in advanced technology nodes so that it is necessary to predict routability with more features. I propose a methodology of predicting the hotspots of design rule violations (DRVs) using machine learning with placement related features and the conventional routing congestion, and perturbating placed cells to reduce the number of DRVs. Precisely, the hotspots are predicted by a pre-trained binary classification model and placement perturbation is performed by global optimization methods to minimize the number of DRVs predicted by a pre-trained regression model. To do this, the framework is composed of three techniques: (1) dividing the circuit layout into multiple rectangular grids and extracting features such as pin density, cell density, global routing results (demand, capacity and overflow), and more in the placement phase, (2) predicting if each grid has DRVs using a binary classification model, and (3) perturbating the placed standard cells in the hotspots to minimize the number of DRVs predicted by a regression model.1 Introduction 1 1.1 Representative Critical Path Circuit . . . . . . . . . . . . . . . . . . . 1 1.2 Prediction of Design Rule Violations and Placement Perturbation . . . 5 1.3 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . 7 2 Methodology for Synthesizing Representative Critical Path Circuits reflecting BEOL Timing Variation 9 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Definitions and Overall Flow . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Techniques for BEOL-Aware RCP Generation . . . . . . . . . . . . . 17 2.3.1 Clustering BEOL Configurations . . . . . . . . . . . . . . . . 17 2.3.2 Formulating Statistical BEOL Random Variables . . . . . . . 18 2.3.3 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 Exploring Ring Oscillator Circuit Structures . . . . . . . . . . 24 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Further Study on Variations . . . . . . . . . . . . . . . . . . . . . . . 37 3 Methodology for Reducing Routing Failures through Enhanced Prediction on Design Rule Violations in Placement 39 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Techniques for Reducing Routing Failures . . . . . . . . . . . . . . . 43 3.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 47 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Hotspot Prediction . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 57 4 Conclusions 61 4.1 Synthesis of Representative Critical Path Circuits reflecting BEOL Timing Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Reduction of Routing Failures through Enhanced Prediction on Design Rule Violations in Placement . . . . . . . . . . . . . . . . . . . . . . 62 Abstract (In Korean) 69Docto

SNU Open Repository and Archive

Design of Resistive Synaptic Devices and Array Architectures for Neuromorphic Computing

Author
Publication venue
Publication date: 01/01/2018
Field of study

abstract: Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional CMOS. In recent years, many efforts have been devoted in the research of next-generation emerging non-volatile memory (eNVM) technologies, such as resistive random access memory (RRAM) and phase change memory (PCM), to replace conventional digital memories (e.g. SRAM) for implementation of synapses in large-scale neuromorphic computing systems. Essentially being compact and “analog”, these eNVM devices in a crossbar array can compute vector-matrix multiplication in parallel, significantly speeding up the machine/deep learning algorithms. However, non-ideal eNVM device and array properties may hamper the learning accuracy. To quantify their impact, the sparse coding algorithm was used as a starting point, where the strategies to remedy the accuracy loss were proposed, and the circuit-level design trade-offs were also analyzed. At architecture level, the parallel “pseudo-crossbar” array to prevent the write disturbance issue was presented. The peripheral circuits to support various parallel array architectures were also designed. One key component is the read circuit that employs the principle of integrate-and-fire neuron model to convert the analog column current to digital output. However, the read circuit is not area-efficient, which was proposed to be replaced with a compact two-terminal oscillation neuron device that exhibits metal-insulator-transition phenomenon. To facilitate the design exploration, a circuit-level macro simulator “NeuroSim” was developed in C++ to estimate the area, latency, energy and leakage power of various neuromorphic architectures. NeuroSim provides a wide variety of design options at the circuit/device level. NeuroSim can be used alone or as a supporting module to provide circuit-level performance estimation in neural network algorithms. A 2-layer multilayer perceptron (MLP) simulator with integration of NeuroSim was demonstrated to evaluate both the learning accuracy and circuit-level performance metrics for the online learning and offline classification, as well as to study the impact of eNVM reliability issues such as data retention and write endurance on the learning performance.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

ASU Digital Repository

CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution

Author: Ashraf Mohammed
Chielle Eduardo
Gamil Homer
Gebremichael Mizan Abraha
Karri Ramesh
Maniatakos Michail
Nabeel Mohammed
Sanduleanu Mihai
Soni Deepraj
Publication venue
Publication date: 19/04/2022
Field of study

The migration of computation to the cloud has raised privacy concerns as sensitive data becomes vulnerable to attacks since they need to be decrypted for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it enables meaningful computations to be performed directly on encrypted data. Nevertheless, FHE is orders of magnitude slower than unencrypted computation, which hinders its practicality and adoption. Therefore, improving FHE performance is essential for its real world deployment. In this paper, we present a year-long effort to design, implement, fabricate, and post-silicon validate a hardware accelerator for Fully Homomorphic Encryption dubbed CoFHEE. With a design area of

12mm^2

, CoFHEE aims to improve performance of ciphertext multiplications, the most demanding arithmetic FHE operation, by accelerating several primitive operations on polynomials, such as polynomial additions and subtractions, Hadamard product, and Number Theoretic Transform. CoFHEE supports polynomial degrees of up to

n = 2^{14}

with a maximum coefficient sizes of 128 bits, while it is capable of performing ciphertext multiplications entirely on chip for

n \leq 2^{13}

. CoFHEE is fabricated in 55nm CMOS technology and achieves 250 MHz with our custom-built low-power digital PLL design. In addition, our chip includes two communication interfaces to the host machine: UART and SPI. This manuscript presents all steps and design techniques in the ASIC development process, ranging from RTL design to fabrication and validation. We evaluate our chip with performance and power experiments and compare it against state-of-the-art software implementations and other ASIC designs. Developed RTL files are available in an open-source repository

arXiv.org e-Print Archive