3,061 research outputs found

    Resolving the Memory Bottleneck for Single Supply Near-Threshold Computing

    Get PDF
    This paper focuses on a review of state-of-the-art memory designs and new design methods for near-threshold computing (NTC). In particular, it presents new ways to design reliable low-voltage NTC memories cost-effectively by reusing available cell libraries, or by adding a digital wrapper around existing commercially available memories. The approach is based on modeling at system level supported by silicon measurement on a test chip in a 40nm low-power processing technology. Advanced monitoring, control and run-time error mitigation schemes enable the operation of these memories at the same optimal near-Vt voltage level as the digital logic. Reliability degradation is thus overcome and this opens the way to solve the memory bottleneck in NTC systems. Starting from the available 40 nm silicon measurements, the analysis is extended to future 14 and 10 nm technology nodes

    Tackling Choke Point Induced Performance Bottlenecks in a Near-Threshold GPGPU

    Get PDF
    Over the last decade, General Purpose Graphics Processing Units (GPGPUs) have garnered a substantial attention in the research community due to their extensive thread-level parallelism. GPGPUs provide a remarkable performance improvement over Central Processing Units (CPUs), for highly parallel applications. However, GPGPUs typically achieve this extensive thread-level parallelism at the cost of a large power consumption. Consequently, Near-Threshold Computing (NTC) provides a promising opportunity for designing energy-efficient GPGPUs (NTC-GPUs). However, NTC-GPUs suffer from a crucial Process Variation (PV)-inflicted performance bottleneck, which is called Choke Point. Choke Point is defined as one or small group of gates which is affected by PV. Choke Point is capable of varying the path-delay of circuit and causing different forms of timing violation. In this work, a cross-layer design technique is proposed to tackle the performance impediments caused by choke points in NTC-GPUs

    Principles of Neuromorphic Photonics

    Full text link
    In an age overrun with information, the ability to process reams of data has become crucial. The demand for data will continue to grow as smart gadgets multiply and become increasingly integrated into our daily lives. Next-generation industries in artificial intelligence services and high-performance computing are so far supported by microelectronic platforms. These data-intensive enterprises rely on continual improvements in hardware. Their prospects are running up against a stark reality: conventional one-size-fits-all solutions offered by digital electronics can no longer satisfy this need, as Moore's law (exponential hardware scaling), interconnection density, and the von Neumann architecture reach their limits. With its superior speed and reconfigurability, analog photonics can provide some relief to these problems; however, complex applications of analog photonics have remained largely unexplored due to the absence of a robust photonic integration industry. Recently, the landscape for commercially-manufacturable photonic chips has been changing rapidly and now promises to achieve economies of scale previously enjoyed solely by microelectronics. The scientific community has set out to build bridges between the domains of photonic device physics and neural networks, giving rise to the field of \emph{neuromorphic photonics}. This article reviews the recent progress in integrated neuromorphic photonics. We provide an overview of neuromorphic computing, discuss the associated technology (microelectronic and photonic) platforms and compare their metric performance. We discuss photonic neural network approaches and challenges for integrated neuromorphic photonic processors while providing an in-depth description of photonic neurons and a candidate interconnection architecture. We conclude with a future outlook of neuro-inspired photonic processing.Comment: 28 pages, 19 figure

    PYDAC: A DISTRIBUTED RUNTIME SYSTEM AND PROGRAMMING MODEL FOR A HETEROGENEOUS MANY-CORE ARCHITECTURE

    Get PDF
    Heterogeneous many-core architectures that consist of big, fast cores and small, energy-efficient cores are very promising for future high-performance computing (HPC) systems. These architectures offer a good balance between single-threaded perfor- mance and multithreaded throughput. Such systems impose challenges on the design of programming model and runtime system. Specifically, these challenges include (a) how to fully utilize the chipโ€™s performance, (b) how to manage heterogeneous, un- reliable hardware resources, and (c) how to generate and manage a large amount of parallel tasks. This dissertation proposes and evaluates a Python-based programming framework called PyDac. PyDac supports a two-level programming model. At the high level, a programmer creates a very large number of tasks, using the divide-and-conquer strategy. At the low level, tasks are written in imperative programming style. The runtime system seamlessly manages the parallel tasks, system resilience, and inter- task communication with architecture support. PyDac has been implemented on both an field-programmable gate array (FPGA) emulation of an unconventional het- erogeneous architecture and a conventional multicore microprocessor. To evaluate the performance, resilience, and programmability of the proposed system, several micro-benchmarks were developed. We found that (a) the PyDac abstracts away task communication and achieves programmability, (b) the micro-benchmarks are scalable on the hardware prototype, but (predictably) serial operation limits some micro-benchmarks, and (c) the degree of protection versus speed could be varied in redundant threading that is transparent to programmers

    ์ •์  ๋žจ ๋ฐ ํŒŒ์›Œ ๊ฒŒ์ดํŠธ ํšŒ๋กœ์— ๋Œ€ํ•œ ์ „์•• ๋ฐ ๋ณด์กด์šฉ ๊ณต๊ฐ„ ํ• ๋‹น ๋ฌธ์ œ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ๊ณต๊ณผ๋Œ€ํ•™ ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2021.8. ๊น€ํƒœํ™˜.์นฉ์˜ ์ €์ „๋ ฅ ๋™์ž‘์€ ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋ฉฐ, ๊ณต์ •์ด ๋ฐœ์ „ํ•˜๋ฉด์„œ ๊ทธ ์ค‘์š”์„ฑ์€ ์ ์  ์ปค์ง€๊ณ  ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ์นฉ์„ ๊ตฌ์„ฑํ•˜๋Š” ์ •์  ๋žจ(SRAM) ๋ฐ ๋กœ์ง(logic) ๊ฐ๊ฐ์— ๋Œ€ํ•ด์„œ ์ €์ „๋ ฅ์œผ๋กœ ๋™์ž‘์‹œํ‚ค๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ๋…ผํ•œ๋‹ค. ์šฐ์„ , ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์นฉ์„ ๋ฌธํ„ฑ ์ „์•• ๊ทผ์ฒ˜์˜ ์ „์••(NTV)์—์„œ ๋™์ž‘์‹œํ‚ค๊ณ ์ž ํ•  ๋•Œ ๋ชจ๋‹ˆํ„ฐ๋ง ํšŒ๋กœ์˜ ์ธก์ •์„ ํ†ตํ•ด ์นฉ ๋‚ด์˜ ๋ชจ๋“  SRAM ๋ธ”๋ก์—์„œ ๋™์ž‘ ์‹คํŒจ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š” ์ตœ์†Œ ๋™์ž‘ ์ „์••์„ ์ถ”๋ก ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ์นฉ์„ NTV ์˜์—ญ์—์„œ ๋™์ž‘์‹œํ‚ค๋Š” ๊ฒƒ์€ ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ์ฆ๋Œ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋Š” ๋งค์šฐ ํšจ๊ณผ์ ์ธ ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด์ง€๋งŒ SRAM์˜ ๊ฒฝ์šฐ ๋™์ž‘ ์‹คํŒจ ๋•Œ๋ฌธ์— ๋™์ž‘ ์ „์••์„ ๋‚ฎ์ถ”๊ธฐ ์–ด๋ ต๋‹ค. ํ•˜์ง€๋งŒ ์นฉ๋งˆ๋‹ค ์˜ํ–ฅ์„ ๋ฐ›๋Š” ๊ณต์ • ๋ณ€์ด๊ฐ€ ๋‹ค๋ฅด๋ฏ€๋กœ ์ตœ์†Œ ๋™์ž‘ ์ „์••์€ ์นฉ๋งˆ๋‹ค ๋‹ค๋ฅด๋ฉฐ, ๋ชจ๋‹ˆํ„ฐ๋ง์„ ํ†ตํ•ด ์ด๋ฅผ ์ถ”๋ก ํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค๋ฉด ์นฉ๋ณ„๋กœ SRAM์— ์„œ๋กœ ๋‹ค๋ฅธ ์ „์••์„ ์ธ๊ฐ€ํ•ด ์—๋„ˆ์ง€ ํšจ์œจ์„ฑ์„ ๋†’์ผ ์ˆ˜ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๊ณผ์ •์„ ํ†ตํ•ด ์ด ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค: (1) ๋””์ž์ธ ์ธํ”„๋ผ ์„ค๊ณ„ ๋‹จ๊ณ„์—์„œ๋Š” SRAM์˜ ์ตœ์†Œ ๋™์ž‘ ์ „์••์„ ์ถ”๋ก ํ•˜๊ณ  ์นฉ ์ƒ์‚ฐ ๋‹จ๊ณ„์—์„œ๋Š” SRAM ๋ชจ๋‹ˆํ„ฐ์˜ ์ธก์ •์„ ํ†ตํ•ด ์ „์••์„ ์ธ๊ฐ€ํ•˜๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค; (2) ์นฉ์˜ SRAM ๋น„ํŠธ์…€(bitcell)๊ณผ ์ฃผ๋ณ€ ํšŒ๋กœ๋ฅผ ํฌํ•จํ•œ SRAM ๋ธ”๋ก๋“ค์˜ ๊ณต์ • ๋ณ€์ด๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ์ˆ˜ ์žˆ๋Š” SRAM ๋ชจ๋‹ˆํ„ฐ์™€ SRAM ๋ชจ๋‹ˆํ„ฐ์—์„œ ๋ชจ๋‹ˆํ„ฐ๋งํ•  ๋Œ€์ƒ์„ ์ •์˜ํ•œ๋‹ค; (3) SRAM ๋ชจ๋‹ˆํ„ฐ์˜ ์ธก์ •๊ฐ’์„ ์ด์šฉํ•ด ๊ฐ™์€ ์นฉ์— ์กด์žฌํ•˜๋Š” ๋ชจ๋“  SRAM ๋ธ”๋ก์—์„œ ๋ชฉํ‘œ ์‹ ๋ขฐ์ˆ˜์ค€ ๋‚ด์—์„œ ์ฝ๊ธฐ, ์“ฐ๊ธฐ, ๋ฐ ์ ‘๊ทผ ๋™์ž‘ ์‹คํŒจ๊ฐ€ ๋ฐœ์ƒํ•˜์ง€ ์•Š๋Š” ์ตœ์†Œ ๋™์ž‘ ์ „์••์„ ์ถ”๋ก ํ•œ๋‹ค. ๋ฒค์น˜๋งˆํฌ ํšŒ๋กœ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•์„ ๋”ฐ๋ผ ์นฉ๋ณ„๋กœ SRAM ๋ธ”๋ก๋“ค์˜ ์ตœ์†Œ ๋™์ž‘ ์ „์••์„ ๋‹ค๋ฅด๊ฒŒ ์ธ๊ฐ€ํ•  ๊ฒฝ์šฐ, ๊ธฐ์กด ๋ฐฉ๋ฒ•๋Œ€๋กœ ๋ชจ๋“  ์นฉ์— ๋™์ผํ•œ ์ „์••์„ ์ธ๊ฐ€ํ•˜๋Š” ๊ฒƒ ๋Œ€๋น„ ์ˆ˜์œจ์€ ๊ฐ™์€ ์ˆ˜์ค€์œผ๋กœ ์œ ์ง€ํ•˜๋ฉด์„œ SRAM ๋น„ํŠธ์…€ ๋ฐฐ์—ด์˜ ์ „๋ ฅ ์†Œ๋ชจ๋ฅผ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋‘ ๋ฒˆ์งธ๋กœ, ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ํŒŒ์›Œ ๊ฒŒ์ดํŠธ ํšŒ๋กœ์—์„œ ๊ธฐ์กด์˜ ๋ณด์กด์šฉ ๊ณต๊ฐ„ ํ• ๋‹น ๋ฐฉ๋ฒ•๋“ค์ด ์ง€๋‹ˆ๊ณ  ์žˆ๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ณ  ๋ˆ„์„ค ์ „๋ ฅ ์†Œ๋ชจ๋ฅผ ๋” ์ค„์ผ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ์„ ์ œ์•ˆํ•œ๋‹ค. ๊ธฐ์กด์˜ ๋ณด์กด์šฉ ๊ณต๊ฐ„ ํ• ๋‹น ๋ฐฉ๋ฒ•์€ ๋ฉ€ํ‹ฐํ”Œ๋ ‰์„œ ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„๊ฐ€ ์žˆ๋Š” ๋ชจ๋“  ํ”Œ๋ฆฝํ”Œ๋กญ์—๋Š” ๋ฌด์กฐ๊ฑด ๋ณด์กด์šฉ ๊ณต๊ฐ„์„ ํ• ๋‹นํ•ด์•ผ ํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋‹ค์ค‘ ๋น„ํŠธ ๋ณด์กด์šฉ ๊ณต๊ฐ„์˜ ์žฅ์ ์„ ์ถฉ๋ถ„ํžˆ ์‚ด๋ฆฌ์ง€ ๋ชปํ•˜๋Š” ๋ฌธ์ œ๊ฐ€ ์žˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฐฉ๋ฒ•์„ ํ†ตํ•ด ๋ณด์กด์šฉ ๊ณต๊ฐ„์„ ์ตœ์†Œํ™”ํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค: (1) ๋ณด์กด์šฉ ๊ณต๊ฐ„ ํ• ๋‹น ๊ณผ์ •์—์„œ ๋ฉ€ํ‹ฐํ”Œ๋ ‰์„œ ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„๋ฅผ ๋ฌด์‹œํ•  ์ˆ˜ ์žˆ๋Š” ์กฐ๊ฑด์„ ์ œ์‹œํ•˜๊ณ , (2) ํ•ด๋‹น ์กฐ๊ฑด์„ ์ด์šฉํ•ด ๋ฉ€ํ‹ฐํ”Œ๋ ‰์„œ ํ”ผ๋“œ๋ฐฑ ๋ฃจํ”„๊ฐ€ ์žˆ๋Š” ํ”Œ๋ฆฝํ”Œ๋กญ์ด ๋งŽ์ด ์กด์žฌํ•˜๋Š” ํšŒ๋กœ์—์„œ ๋ณด์กด์šฉ ๊ณต๊ฐ„์„ ์ตœ์†Œํ™”ํ•œ๋‹ค; (3) ์ถ”๊ฐ€๋กœ, ํ”Œ๋ฆฝํ”Œ๋กญ์— ์ด๋ฏธ ํ• ๋‹น๋œ ๋ณด์กด์šฉ ๊ณต๊ฐ„ ์ค‘ ์ผ๋ถ€๋ฅผ ์ œ๊ฑฐํ•  ์ˆ˜ ์žˆ๋Š” ์กฐ๊ฑด์„ ์ฐพ๊ณ , ์ด๋ฅผ ์ด์šฉํ•ด ๋ณด์กด์šฉ ๊ณต๊ฐ„์„ ๋” ๊ฐ์†Œ์‹œํ‚จ๋‹ค. ๋ฒค์น˜๋งˆํฌ ํšŒ๋กœ์˜ ์‹คํ—˜ ๊ฒฐ๊ณผ๋Š” ๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ ๋ฐฉ๋ฒ•๋ก ์ด ๊ธฐ์กด์˜ ๋ณด์กด์šฉ ๊ณต๊ฐ„ ํ• ๋‹น ๋ฐฉ๋ฒ•๋ก ๋ณด๋‹ค ๋” ์ ์€ ๋ณด์กด์šฉ ๊ณต๊ฐ„์„ ํ• ๋‹นํ•˜๋ฉฐ, ๋”ฐ๋ผ์„œ ์นฉ์˜ ๋ฉด์  ๋ฐ ์ „๋ ฅ ์†Œ๋ชจ๋ฅผ ๊ฐ์†Œ์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค.Low power operation of a chip is an important issue, and its importance is increasing as the process technology advances. This dissertation addresses the methodology of operating at low power for each of the SRAM and logic constituting the chip. Firstly, we propose a methodology to infer the minimum operating voltage at which SRAM failure does not occur in all SRAM blocks in the chip operating on near threshold voltage (NTV) regime through the measurement of a monitoring circuit. Operating the chip on NTV regime is one of the most effective ways to increase energy efficiency, but in case of SRAM, it is difficult to lower the operating voltage because of SRAM failure. However, since the process variation on each chip is different, the minimum operating voltage is also different for each chip. If it is possible to infer the minimum operating voltage of SRAM blocks of each chip through monitoring, energy efficiency can be increased by applying different voltage. In this dissertation, we propose a new methodology of resolving this problem. Specifically, (1) we propose to infer minimum operation voltage of SRAM in design infra development phase, and assign the voltage using measurement of SRAM monitor in silicon production phase; (2) we define a SRAM monitor and features to be monitored that can monitor process variation on SRAM blocks including SRAM bitcell and peripheral circuits; (3) we propose a new methodology of inferring minimum operating voltage of SRAM blocks in a chip that does not cause read, write, and access failures under a target confidence level. Through experiments with benchmark circuits, it is confirmed that applying different voltage to SRAM blocks in each chip that inferred by our proposed methodology can save overall power consumption of SRAM bitcell array compared to applying same voltage to SRAM blocks in all chips, while meeting the same yield target. Secondly, we propose a methodology to resolve the problem of the conventional retention storage allocation methods and thereby further reduce leakage power consumption of power gated circuit. Conventional retention storage allocation methods have problem of not fully utilizing the advantage of multi-bit retention storage because of the unavoidable allocation of retention storage on flip-flops with mux-feedback loop. In this dissertation, we propose a new methodology of breaking the bottleneck of minimizing the state retention storage. Specifically, (1) we find a condition that mux-feedback loop can be disregarded during the retention storage allocation; (2) utilizing the condition, we minimize the retention storage of circuits that contain many flip-flops with mux-feedback loop; (3) we find a condition to remove some of the retention storage already allocated to each of flip-flops and propose to further reduce the retention storage. Through experiments with benchmark circuits, it is confirmed that our proposed methodology allocates less retention storage compared to the state-of-the-art methods, occupying less cell area and consuming less power.1 Introduction 1 1.1 Low Voltage SRAM Monitoring Methodology 1 1.2 Retention Storage Allocation on Power Gated Circuit 5 1.3 Contributions of this Dissertation 8 2 SRAM On-Chip Monitoring Methodology for High Yield and Energy Efficient Memory Operation at Near Threshold Voltage 13 2.1 SRAM Failures 13 2.1.1 Read Failure 13 2.1.2 Write Failure 15 2.1.3 Access Failure 16 2.1.4 Hold Failure 16 2.2 SRAM On-chip Monitoring Methodology: Bitcell Variation 18 2.2.1 Overall Flow 18 2.2.2 SRAM Monitor and Monitoring Target 18 2.2.3 Vfail to Vddmin Inference 22 2.3 SRAM On-chip Monitoring Methodology: Peripheral Circuit IR Drop and Variation 29 2.3.1 Consideration of IR Drop 29 2.3.2 Consideration of Peripheral Circuit Variation 30 2.3.3 Vddmin Prediction including Access Failure Prohibition 33 2.4 Experimental Results 41 2.4.1 Vddmin Considering Read and Write Failures 42 2.4.2 Vddmin Considering Read/Write and Access Failures 45 2.4.3 Observation for Practical Use 45 3 Allocation of Always-On State Retention Storage for Power Gated Circuits - Steady State Driven Approach 49 3.1 Motivations and Analysis 49 3.1.1 Impact of Self-loop on Power Gating 49 3.1.2 Circuit Behavior Before Sleeping 52 3.1.3 Wakeup Latency vs. Retention Storage 54 3.2 Steady State Driven Retention Storage Allocation 56 3.2.1 Extracting Steady State Self-loop FFs 57 3.2.2 Allocating State Retention Storage 59 3.2.3 Designing and Optimizing Steady State Monitoring Logic 59 3.2.4 Analysis of the Impact of Steady State Monitoring Time on the Standby Power 63 3.3 Retention Storage Refinement Utilizing Steadiness 65 3.3.1 Extracting Flip-flops for Retention Storage Refinement 66 3.3.2 Designing State Monitoring Logic and Control Signals 68 3.4 Experimental Results 73 3.4.1 Comparison of State Retention Storage 75 3.4.2 Comparison of Power Consumption 79 3.4.3 Impact on Circuit Performance 82 3.4.4 Support for Immediate Power Gating 83 4 Conclusions 89 4.1 Chapter 2 89 4.2 Chapter 3 90๋ฐ•
    • โ€ฆ
    corecore