189 research outputs found

    Reconfigurable Architectures and Systems for IoT Applications

    Get PDF
    abstract: Internet of Things (IoT) has become a popular topic in industry over the recent years, which describes an ecosystem of internet-connected devices or things that enrich the everyday life by improving our productivity and efficiency. The primary components of the IoT ecosystem are hardware, software and services. While the software and services of IoT system focus on data collection and processing to make decisions, the underlying hardware is responsible for sensing the information, preprocess and transmit it to the servers. Since the IoT ecosystem is still in infancy, there is a great need for rapid prototyping platforms that would help accelerate the hardware design process. However, depending on the target IoT application, different sensors are required to sense the signals such as heart-rate, temperature, pressure, acceleration, etc., and there is a great need for reconfigurable platforms that can prototype different sensor interfacing circuits. This thesis primarily focuses on two important hardware aspects of an IoT system: (a) an FPAA based reconfigurable sensing front-end system and (b) an FPGA based reconfigurable processing system. To enable reconfiguration capability for any sensor type, Programmable ANalog Device Array (PANDA), a transistor-level analog reconfigurable platform is proposed. CAD tools required for implementation of front-end circuits on the platform are also developed. To demonstrate the capability of the platform on silicon, a small-scale array of 24ร—25 PANDA cells is fabricated in 65nm technology. Several analog circuit building blocks including amplifiers, bias circuits and filters are prototyped on the platform, which demonstrates the effectiveness of the platform for rapid prototyping IoT sensor interfaces. IoT systems typically use machine learning algorithms that run on the servers to process the data in order to make decisions. Recently, embedded processors are being used to preprocess the data at the energy-constrained sensor node or at IoT gateway, which saves considerable energy for transmission and bandwidth. Using conventional CPU based systems for implementing the machine learning algorithms is not energy-efficient. Hence an FPGA based hardware accelerator is proposed and an optimization methodology is developed to maximize throughput of any convolutional neural network (CNN) based machine learning algorithm on a resource-constrained FPGA.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    Rapid SoC Design: On Architectures, Methodologies and Frameworks

    Full text link
    Modern applications like machine learning, autonomous vehicles, and 5G networking require an order of magnitude boost in processing capability. For several decades, chip designers have relied on Mooreโ€™s Law - the doubling of transistor count every two years to deliver improved performance, higher energy efficiency, and an increase in transistor density. With the end of Dennardโ€™s scaling and a slowdown in Mooreโ€™s Law, system architects have developed several techniques to deliver on the traditional performance and power improvements we have come to expect. More recently, chip designers have turned towards heterogeneous systems comprised of more specialized processing units to buttress the traditional processing units. These specialized units improve the overall performance, power, and area (PPA) metrics across a wide variety of workloads and applications. While the GPU serves as a classical example, accelerators for machine learning, approximate computing, graph processing, and database applications have become commonplace. This has led to an exponential growth in the variety (and count) of these compute units found in modern embedded and high-performance computing platforms. The various techniques adopted to combat the slowing of Mooreโ€™s Law directly translates to an increase in complexity for modern system-on-chips (SoCs). This increase in complexity in turn leads to an increase in design effort and validation time for hardware and the accompanying software stacks. This is further aggravated by fabrication challenges (photo-lithography, tooling, and yield) faced at advanced technology nodes (below 28nm). The inherent complexity in modern SoCs translates into increased costs and time-to-market delays. This holds true across the spectrum, from mobile/handheld processors to high-performance data-center appliances. This dissertation presents several techniques to address the challenges of rapidly birthing complex SoCs. The first part of this dissertation focuses on foundations and architectures that aid in rapid SoC design. It presents a variety of architectural techniques that were developed and leveraged to rapidly construct complex SoCs at advanced process nodes. The next part of the dissertation focuses on the gap between a completed design model (in RTL form) and its physical manifestation (a GDS file that will be sent to the foundry for fabrication). It presents methodologies and a workflow for rapidly walking a design through to completion at arbitrary technology nodes. It also presents progress on creating tools and a flow that is entirely dependent on open-source tools. The last part presents a framework that not only speeds up the integration of a hardware accelerator into an SoC ecosystem, but emphasizes software adoption and usability.PHDElectrical and Computer EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/168119/1/ajayi_1.pd

    NeuroSim Simulator for Compute-in-Memory Hardware Accelerator: Validation and Benchmark

    Get PDF
    Compute-in-memory (CIM) is an attractive solution to process the extensive workloads of multiply-and-accumulate (MAC) operations in deep neural network (DNN) hardware accelerators. A simulator with options of various mainstream and emerging memory technologies, architectures, and networks can be a great convenience for fast early-stage design space exploration of CIM hardware accelerators. DNN+NeuroSim is an integrated benchmark framework supporting flexible and hierarchical CIM array design options from a device level, to a circuit level and up to an algorithm level. In this study, we validate and calibrate the prediction of NeuroSim against a 40-nm RRAM-based CIM macro post-layout simulations. First, the parameters of a memory device and CMOS transistor are extracted from the foundryโ€™s process design kit (PDK) and employed in the NeuroSim settings; the peripheral modules and operating dataflow are also configured to be the same as the actual chip implementation. Next, the area, critical path, and energy consumption values from the SPICE simulations at the module level are compared with those from NeuroSim. Some adjustment factors are introduced to account for transistor sizing and wiring area in the layout, gate switching activity, post-layout performance drop, etc. We show that the prediction from NeuroSim is precise with chip-level error under 1% after the calibration. Finally, the system-level performance benchmark is conducted with various device technologies and compared with the results before the validation. The general conclusions stay the same after the validation, but the performance degrades slightly due to the post-layout calibration

    Semiconductor Memory Applications in Radiation Environment, Hardware Security and Machine Learning System

    Get PDF
    abstract: Semiconductor memory is a key component of the computing systems. Beyond the conventional memory and data storage applications, in this dissertation, both mainstream and eNVM memory technologies are explored for radiation environment, hardware security system and machine learning applications. In the radiation environment, e.g. aerospace, the memory devices face different energetic particles. The strike of these energetic particles can generate electron-hole pairs (directly or indirectly) as they pass through the semiconductor device, resulting in photo-induced current, and may change the memory state. First, the trend of radiation effects of the mainstream memory technologies with technology node scaling is reviewed. Then, single event effects of the oxide based resistive switching random memory (RRAM), one of eNVM technologies, is investigated from the circuit-level to the system level. Physical Unclonable Function (PUF) has been widely investigated as a promising hardware security primitive, which employs the inherent randomness in a physical system (e.g. the intrinsic semiconductor manufacturing variability). In the dissertation, two RRAM-based PUF implementations are proposed for cryptographic key generation (weak PUF) and device authentication (strong PUF), respectively. The performance of the RRAM PUFs are evaluated with experiment and simulation. The impact of non-ideal circuit effects on the performance of the PUFs is also investigated and optimization strategies are proposed to solve the non-ideal effects. Besides, the security resistance against modeling and machine learning attacks is analyzed as well. Deep neural networks (DNNs) have shown remarkable improvements in various intelligent applications such as image classification, speech classification and object localization and detection. Increasing efforts have been devoted to develop hardware accelerators. In this dissertation, two types of compute-in-memory (CIM) based hardware accelerator designs with SRAM and eNVM technologies are proposed for two binary neural networks, i.e. hybrid BNN (HBNN) and XNOR-BNN, respectively, which are explored for the hardware resource-limited platforms, e.g. edge devices.. These designs feature with high the throughput, scalability, low latency and high energy efficiency. Finally, we have successfully taped-out and validated the proposed designs with SRAM technology in TSMC 65 nm. Overall, this dissertation paves the paths for memory technologiesโ€™ new applications towards the secure and energy-efficient artificial intelligence system.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    AI/ML Algorithms and Applications in VLSI Design and Technology

    Full text link
    An evident challenge ahead for the integrated circuit (IC) industry in the nanometer regime is the investigation and development of methods that can reduce the design complexity ensuing from growing process variations and curtail the turnaround time of chip manufacturing. Conventional methodologies employed for such tasks are largely manual; thus, time-consuming and resource-intensive. In contrast, the unique learning strategies of artificial intelligence (AI) provide numerous exciting automated approaches for handling complex and data-intensive tasks in very-large-scale integration (VLSI) design and testing. Employing AI and machine learning (ML) algorithms in VLSI design and manufacturing reduces the time and effort for understanding and processing the data within and across different abstraction levels via automated learning algorithms. It, in turn, improves the IC yield and reduces the manufacturing turnaround time. This paper thoroughly reviews the AI/ML automated approaches introduced in the past towards VLSI design and manufacturing. Moreover, we discuss the scope of AI/ML applications in the future at various abstraction levels to revolutionize the field of VLSI design, aiming for high-speed, highly intelligent, and efficient implementations

    ์ดˆ๋ฏธ์„ธ ํšŒ๋กœ ์„ค๊ณ„๋ฅผ ์œ„ํ•œ ์ธํ„ฐ์ปค๋„ฅํŠธ์˜ ํƒ€์ด๋ฐ ๋ถ„์„ ๋ฐ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ์˜ˆ์ธก

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2021. 2. ๊น€ํƒœํ™˜.ํƒ€์ด๋ฐ ๋ถ„์„ ๋ฐ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ์ œ๊ฑฐ๋Š” ๋ฐ˜๋„์ฒด ์นฉ ์ œ์กฐ๋ฅผ ์œ„ํ•œ ๋งˆ์Šคํฌ ์ œ์ž‘ ์ „์— ์™„๋ฃŒ๋˜์–ด์•ผ ํ•  ํ•„์ˆ˜ ๊ณผ์ •์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ํŠธ๋žœ์ง€์Šคํ„ฐ์™€ ์ธํ„ฐ์ปค๋„ฅํŠธ์˜ ๋ณ€์ด๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๊ณ  ๋””์ž์ธ ๋ฃฐ ์—ญ์‹œ ๋ณต์žกํ•ด์ง€๊ณ  ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ํƒ€์ด๋ฐ ๋ถ„์„ ๋ฐ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ์ œ๊ฑฐ๋Š” ์ดˆ๋ฏธ์„ธ ํšŒ๋กœ์—์„œ ๋” ์–ด๋ ค์›Œ์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์ดˆ๋ฏธ์„ธ ์„ค๊ณ„๋ฅผ ์œ„ํ•œ ๋‘๊ฐ€์ง€ ๋ฌธ์ œ์ธ ํƒ€์ด๋ฐ ๋ถ„์„๊ณผ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜์— ๋Œ€ํ•ด ๋‹ค๋ฃฌ๋‹ค. ์ฒซ๋ฒˆ์งธ๋กœ ๊ณต์ • ์ฝ”๋„ˆ์—์„œ ํƒ€์ด๋ฐ ๋ถ„์„์€ ์‹ค๋ฆฌ์ฝ˜์œผ๋กœ ์ œ์ž‘๋œ ํšŒ๋กœ์˜ ์„ฑ๋Šฅ์„ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜์ง€ ๋ชปํ•œ๋‹ค. ๊ทธ ์ด์œ ๋Š” ๊ณต์ • ์ฝ”๋„ˆ์—์„œ ๊ฐ€์žฅ ๋Š๋ฆฐ ํƒ€์ด๋ฐ ๊ฒฝ๋กœ๊ฐ€ ๋ชจ๋“  ๊ณต์ • ์กฐ๊ฑด์—์„œ๋„ ๊ฐ€์žฅ ๋Š๋ฆฐ ๊ฒƒ์€ ์•„๋‹ˆ๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ์นฉ ๋‚ด์˜ ์ž„๊ณ„ ๊ฒฝ๋กœ์—์„œ ์ธํ„ฐ์ปค๋„ฅํŠธ์— ์˜ํ•œ ์ง€์—ฐ ์‹œ๊ฐ„์ด ์ „์ฒด ์ง€์—ฐ ์‹œ๊ฐ„์—์„œ์˜ ์˜ํ–ฅ์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๊ณ , 10๋‚˜๋…ธ ์ดํ•˜ ๊ณต์ •์—์„œ๋Š” 20%๋ฅผ ์ดˆ๊ณผํ•˜๊ณ  ์žˆ๋‹ค. ์ฆ‰, ์‹ค๋ฆฌ์ฝ˜์œผ๋กœ ์ œ์ž‘๋œ ํšŒ๋กœ์˜ ์„ฑ๋Šฅ์„ ์ •ํ™•ํžˆ ์˜ˆ์ธกํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ๋Œ€ํ‘œ ํšŒ๋กœ๊ฐ€ ํŠธ๋žœ์ง€์Šคํ„ฐ์˜ ๋ณ€์ด ๋ฟ๋งŒ์•„๋‹ˆ๋ผ ์ธํ„ฐ์ปค๋„ฅํŠธ์˜ ๋ณ€์ด๋„ ๋ฐ˜์˜ํ•ด์•ผํ•œ๋‹ค. ์ธํ„ฐ์ปค๋„ฅํŠธ๋ฅผ ๊ตฌ์„ฑํ•˜๋Š” ๊ธˆ์†์ด 10์ธต ์ด์ƒ ์‚ฌ์šฉ๋˜๊ณ  ์žˆ๊ณ , ๊ฐ ์ธต์„ ๊ตฌ์„ฑํ•˜๋Š” ๊ธˆ์†์˜ ์ €ํ•ญ๊ณผ ์บํŒจ์‹œํ„ด์Šค์™€ ๋น„์•„ ์ €ํ•ญ์ด ๋ชจ๋‘ ํšŒ๋กœ ์ง€์—ฐ ์‹œ๊ฐ„์— ์˜ํ–ฅ์„ ์ฃผ๊ธฐ ๋•Œ๋ฌธ์— ๋Œ€ํ‘œ ํšŒ๋กœ๋ฅผ ์ฐพ๋Š” ๋ฌธ์ œ๋Š” ์ฐจ์›์ด ๋งค์šฐ ๋†’์€ ์˜์—ญ์—์„œ ์ตœ์ ์˜ ํ•ด๋ฅผ ์ฐพ๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์ธํ„ฐ์ปค๋„ฅํŠธ๋ฅผ ์ œ์ž‘ํ•˜๋Š” ๊ณต์ •(๋ฐฑ ์—”๋“œ ์˜ค๋ธŒ ๋ผ์ธ)์˜ ๋ณ€์ด๋ฅผ ๋ฐ˜์˜ํ•œ ๋Œ€ํ‘œ ํšŒ๋กœ๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ณต์ • ๋ณ€์ด๊ฐ€ ์—†์„๋•Œ ๊ฐ€์žฅ ๋Š๋ฆฐ ํƒ€์ด๋ฐ ๊ฒฝ๋กœ์— ์‚ฌ์šฉ๋œ ๊ฒŒ์ดํŠธ์™€ ๋ผ์šฐํŒ… ํŒจํ„ด์„ ๋ณ€๊ฒฝํ•˜๋ฉด์„œ ์ ์ง„์ ์œผ๋กœ ํƒ์ƒ‰ํ•˜๋Š” ๋ฐฉ๋ฒ•์ด๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•˜๋Š” ํ•ฉ์„ฑ ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹ค์Œ์˜ ์ƒˆ๋กœ์šด ๊ธฐ์ˆ ๋“ค์„ ํ†ตํ•ฉํ•˜์˜€๋‹ค: (1) ๋ผ์šฐํŒ…์„ ๊ตฌ์„ฑํ•˜๋Š” ์—ฌ๋Ÿฌ ๊ธˆ์† ์ธต๊ณผ ๋น„์•„๋ฅผ ์ถ”์ถœํ•˜๊ณ  ํƒ์ƒ‰ ์‹œ๊ฐ„ ๊ฐ์†Œ๋ฅผ ์œ„ํ•ด ์œ ์‚ฌํ•œ ๊ตฌ์„ฑ๋“ค์„ ๊ฐ™์€ ๋ฒ”์ฃผ๋กœ ๋ถ„๋ฅ˜ํ•˜์˜€๋‹ค. (2) ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•œ ํƒ€์ด๋ฐ ๋ถ„์„์„ ์œ„ํ•˜์—ฌ ์—ฌ๋Ÿฌ ๊ธˆ์† ์ธต๊ณผ ๋น„์•„๋“ค์˜ ๋ณ€์ด๋ฅผ ์ˆ˜์‹ํ™”ํ•˜์˜€๋‹ค. (3) ํ™•์žฅ์„ฑ์„ ๊ณ ๋ คํ•˜์—ฌ ์ผ๋ฐ˜์ ์ธ ๋ง ์˜ค์‹ค๋ ˆ์ดํ„ฐ๋กœ ๋Œ€ํ‘œํšŒ๋กœ๋ฅผ ํƒ์ƒ‰ํ•˜์˜€๋‹ค. ๋‘๋ฒˆ์งธ๋กœ ๋””์ž์ธ ๋ฃฐ์˜ ๋ณต์žก๋„๊ฐ€ ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๊ณ , ์ด๋กœ ์ธํ•ด ํ‘œ์ค€ ์…€๋“ค์˜ ์ธํ„ฐ์ปค๋„ฅํŠธ๋ฅผ ํ†ตํ•œ ์—ฐ๊ฒฐ์„ ์ง„ํ–‰ํ•˜๋Š” ๋™์•ˆ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜์ด ์ฆ๊ฐ€ํ•˜๊ณ  ์žˆ๋‹ค. ๊ฒŒ๋‹ค๊ฐ€ ํ‘œ์ค€ ์…€์˜ ํฌ๊ธฐ๊ฐ€ ๊ณ„์† ์ž‘์•„์ง€๋ฉด์„œ ์…€๋“ค์˜ ์—ฐ๊ฒฐ์€ ์ ์  ์–ด๋ ค์›Œ์ง€๊ณ  ์žˆ๋‹ค. ๊ธฐ์กด์—๋Š” ํšŒ๋กœ ๋‚ด ๋ชจ๋“  ํ‘œ์ค€ ์…€์„ ์—ฐ๊ฒฐํ•˜๋Š”๋ฐ ํ•„์š”ํ•œ ํŠธ๋ž™ ์ˆ˜, ๊ฐ€๋Šฅํ•œ ํŠธ๋ž™ ์ˆ˜, ์ด๋“ค ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ์„ฑ์„ ํŒ๋‹จํ•˜๊ณ , ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜์ด ๋ฐœ์ƒํ•˜์ง€ ์•Š๋„๋ก ์…€ ๋ฐฐ์น˜๋ฅผ ์ตœ์ ํ™”ํ•˜์˜€๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ์ตœ์‹  ๊ณต์ •์—์„œ๋Š” ์ •ํ™•ํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋งŽ์€ ์ •๋ณด๋ฅผ ์ด์šฉํ•œ ํšŒ๋กœ๋‚ด ๋ชจ๋“  ํ‘œ์ค€ ์…€ ์‚ฌ์ด์˜ ์—ฐ๊ฒฐ ๊ฐ€๋Šฅ์„ฑ์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ํ•„์š”ํ•˜๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๊ธฐ๊ณ„ ํ•™์Šต์„ ํ†ตํ•ด ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜์ด ๋ฐœ์ƒํ•˜๋Š” ์˜์—ญ ๋ฐ ๊ฐœ์ˆ˜๋ฅผ ์˜ˆ์ธกํ•˜๊ณ  ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด ํ‘œ์ค€ ์…€์˜ ๋ฐฐ์น˜๋ฅผ ๋ฐ”๊พธ๋Š” ๋ฐฉ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ์˜์—ญ์€ ์ด์ง„ ๋ถ„๋ฅ˜๋กœ ์˜ˆ์ธกํ•˜์˜€๊ณ  ํ‘œ์ค€ ์…€์˜ ๋ฐฐ์น˜๋Š” ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ๊ฐœ์ˆ˜๋ฅผ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ์ตœ์ ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ์ œ์•ˆํ•˜๋Š” ํ”„๋ ˆ์ž„์›Œํฌ๋Š” ๋‹ค์Œ์˜ ์„ธ๊ฐ€์ง€ ๊ธฐ์ˆ ๋กœ ๊ตฌ์„ฑ๋˜์—ˆ๋‹ค: (1) ํšŒ๋กœ ๋ ˆ์ด์•„์›ƒ์„ ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ •์‚ฌ๊ฐํ˜• ๊ฒฉ์ž๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ๊ฒฉ์ž์—์„œ ๋ผ์šฐํŒ…์„ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๋Š” ์š”์†Œ๋“ค์„ ์ถ”์ถœํ•œ๋‹ค. (2) ๊ฐ ๊ฒฉ์ž์—์„œ ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜์ด ์žˆ๋Š”์ง€ ์—ฌ๋ถ€๋ฅผ ํŒ๋‹จํ•˜๋Š” ์ด์ง„ ๋ถ„๋ฅ˜๋ฅผ ์ˆ˜ํ–‰ํ•œ๋‹ค. (3) ๋ฉ”ํƒ€ํœด๋ฆฌ์Šคํ‹ฑ ์ตœ์ ํ™” ๋˜๋Š” ๋ฒ ์ด์ง€์•ˆ ์ตœ์ ํ™”๋ฅผ ์ด์šฉํ•˜์—ฌ ์ „์ฒด ๋””์ž์ธ ๋ฃฐ ์œ„๋ฐ˜ ๊ฐœ์ˆ˜๊ฐ€ ๊ฐ์†Œํ•˜๋„๋ก ๊ฐ ๊ฒฉ์ž์— ์žˆ๋Š” ํ‘œ์ค€ ์…€์„ ์›€์ง์ธ๋‹ค.Timing analysis and clearing design rule violations are the essential steps for taping out a chip. However, they keep getting harder in deep sub-micron circuits because the variations of transistors and interconnects have been increasing and design rules have become more complex. This dissertation addresses two problems on timing analysis and design rule violations for synthesizing deep sub-micron circuits. Firstly, timing analysis in process corners can not capture post-Si performance accurately because the slowest path in the process corner is not always the slowest one in the post-Si instances. In addition, the proportion of interconnect delay in the critical path on a chip is increasing and becomes over 20% in sub-10nm technologies, which means in order to capture post-Si performance accurately, the representative critical path circuit should reflect not only FEOL (front-end-of-line) but also BEOL (backend-of-line) variations. Since the number of BEOL metal layers exceeds ten and the layers have variation on resistance and capacitance intermixed with resistance variation on vias between them, a very high dimensional design space exploration is necessary to synthesize a representative critical path circuit which is able to provide an accurate performance prediction. To cope with this, I propose a BEOL-aware methodology of synthesizing a representative critical path circuit, which is able to incrementally explore, starting from an initial path circuit on the post-Si target circuit, routing patterns (i.e., BEOL reconfiguring) as well as gate resizing on the path circuit. Precisely, the synthesis framework of critical path circuit integrates a set of novel techniques: (1) extracting and classifying BEOL configurations for lightening design space complexity, (2) formulating BEOL random variables for fast and accurate timing analysis, and (3) exploring alternative (ring oscillator) circuit structures for extending the applicability of this work. Secondly, the complexity of design rules has been increasing and results in more design rule violations during routing. In addition, the size of standard cell keeps decreasing and it makes routing harder. In the conventional P&R flow, the routability of pre-routed layout is predicted by routing congestion obtained from global routing, and then placement is optimized not to cause design rule violations. But it turned out to be inaccurate in advanced technology nodes so that it is necessary to predict routability with more features. I propose a methodology of predicting the hotspots of design rule violations (DRVs) using machine learning with placement related features and the conventional routing congestion, and perturbating placed cells to reduce the number of DRVs. Precisely, the hotspots are predicted by a pre-trained binary classification model and placement perturbation is performed by global optimization methods to minimize the number of DRVs predicted by a pre-trained regression model. To do this, the framework is composed of three techniques: (1) dividing the circuit layout into multiple rectangular grids and extracting features such as pin density, cell density, global routing results (demand, capacity and overflow), and more in the placement phase, (2) predicting if each grid has DRVs using a binary classification model, and (3) perturbating the placed standard cells in the hotspots to minimize the number of DRVs predicted by a regression model.1 Introduction 1 1.1 Representative Critical Path Circuit . . . . . . . . . . . . . . . . . . . 1 1.2 Prediction of Design Rule Violations and Placement Perturbation . . . 5 1.3 Contributions of This Dissertation . . . . . . . . . . . . . . . . . . . 7 2 Methodology for Synthesizing Representative Critical Path Circuits reflecting BEOL Timing Variation 9 2.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Definitions and Overall Flow . . . . . . . . . . . . . . . . . . . . . . 12 2.3 Techniques for BEOL-Aware RCP Generation . . . . . . . . . . . . . 17 2.3.1 Clustering BEOL Configurations . . . . . . . . . . . . . . . . 17 2.3.2 Formulating Statistical BEOL Random Variables . . . . . . . 18 2.3.3 Delay Modeling . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3.4 Exploring Ring Oscillator Circuit Structures . . . . . . . . . . 24 2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5 Further Study on Variations . . . . . . . . . . . . . . . . . . . . . . . 37 3 Methodology for Reducing Routing Failures through Enhanced Prediction on Design Rule Violations in Placement 39 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.2 Overall Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 3.3 Techniques for Reducing Routing Failures . . . . . . . . . . . . . . . 43 3.3.1 Binary Classification . . . . . . . . . . . . . . . . . . . . . . 43 3.3.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 3.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3.3.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 47 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.1 Experiments Setup . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.2 Hotspot Prediction . . . . . . . . . . . . . . . . . . . . . . . 51 3.4.3 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.4.4 Placement Perturbation . . . . . . . . . . . . . . . . . . . . . 57 4 Conclusions 61 4.1 Synthesis of Representative Critical Path Circuits reflecting BEOL Timing Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2 Reduction of Routing Failures through Enhanced Prediction on Design Rule Violations in Placement . . . . . . . . . . . . . . . . . . . . . . 62 Abstract (In Korean) 69Docto

    Design of Resistive Synaptic Devices and Array Architectures for Neuromorphic Computing

    Get PDF
    abstract: Over the past few decades, the silicon complementary-metal-oxide-semiconductor (CMOS) technology has been greatly scaled down to achieve higher performance, density and lower power consumption. As the device dimension is approaching its fundamental physical limit, there is an increasing demand for exploration of emerging devices with distinct operating principles from conventional CMOS. In recent years, many efforts have been devoted in the research of next-generation emerging non-volatile memory (eNVM) technologies, such as resistive random access memory (RRAM) and phase change memory (PCM), to replace conventional digital memories (e.g. SRAM) for implementation of synapses in large-scale neuromorphic computing systems. Essentially being compact and โ€œanalogโ€, these eNVM devices in a crossbar array can compute vector-matrix multiplication in parallel, significantly speeding up the machine/deep learning algorithms. However, non-ideal eNVM device and array properties may hamper the learning accuracy. To quantify their impact, the sparse coding algorithm was used as a starting point, where the strategies to remedy the accuracy loss were proposed, and the circuit-level design trade-offs were also analyzed. At architecture level, the parallel โ€œpseudo-crossbarโ€ array to prevent the write disturbance issue was presented. The peripheral circuits to support various parallel array architectures were also designed. One key component is the read circuit that employs the principle of integrate-and-fire neuron model to convert the analog column current to digital output. However, the read circuit is not area-efficient, which was proposed to be replaced with a compact two-terminal oscillation neuron device that exhibits metal-insulator-transition phenomenon. To facilitate the design exploration, a circuit-level macro simulator โ€œNeuroSimโ€ was developed in C++ to estimate the area, latency, energy and leakage power of various neuromorphic architectures. NeuroSim provides a wide variety of design options at the circuit/device level. NeuroSim can be used alone or as a supporting module to provide circuit-level performance estimation in neural network algorithms. A 2-layer multilayer perceptron (MLP) simulator with integration of NeuroSim was demonstrated to evaluate both the learning accuracy and circuit-level performance metrics for the online learning and offline classification, as well as to study the impact of eNVM reliability issues such as data retention and write endurance on the learning performance.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    CoFHEE: A Co-processor for Fully Homomorphic Encryption Execution

    Full text link
    The migration of computation to the cloud has raised privacy concerns as sensitive data becomes vulnerable to attacks since they need to be decrypted for processing. Fully Homomorphic Encryption (FHE) mitigates this issue as it enables meaningful computations to be performed directly on encrypted data. Nevertheless, FHE is orders of magnitude slower than unencrypted computation, which hinders its practicality and adoption. Therefore, improving FHE performance is essential for its real world deployment. In this paper, we present a year-long effort to design, implement, fabricate, and post-silicon validate a hardware accelerator for Fully Homomorphic Encryption dubbed CoFHEE. With a design area of 12mm212mm^2, CoFHEE aims to improve performance of ciphertext multiplications, the most demanding arithmetic FHE operation, by accelerating several primitive operations on polynomials, such as polynomial additions and subtractions, Hadamard product, and Number Theoretic Transform. CoFHEE supports polynomial degrees of up to n=214n = 2^{14} with a maximum coefficient sizes of 128 bits, while it is capable of performing ciphertext multiplications entirely on chip for nโ‰ค213n \leq 2^{13}. CoFHEE is fabricated in 55nm CMOS technology and achieves 250 MHz with our custom-built low-power digital PLL design. In addition, our chip includes two communication interfaces to the host machine: UART and SPI. This manuscript presents all steps and design techniques in the ASIC development process, ranging from RTL design to fabrication and validation. We evaluate our chip with performance and power experiments and compare it against state-of-the-art software implementations and other ASIC designs. Developed RTL files are available in an open-source repository
    • โ€ฆ
    corecore