Search CORE

20 research outputs found

Recommended from our members

Very-Large-Scale-Integration Circuit Techniques in Internet-of-Things Applications

Author: Li Jiangyi
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2018
Field of study

Heading towards the era of Internet-of-things (IoT) means both opportunity and challenge for the circuit-design community. In a system where billions of devices are equipped with the ability to sense, compute, communicate with each other and perform tasks in a coordinated manner, security and power management are among the most critical challenges. Physically unclonable function (PUF) emerges as an important security primitive in hardware-security applications; it provides an object-specific physical identifier hidden within the intrinsic device variations, which is hard to expose and reproduce by adversaries. Yet, designing a compact PUF robust to noise, temperature and voltage remains a challenge. This thesis presents a novel PUF design approach based on a pair of ultra-compact analog circuits whose output is proportional to absolute temperature. The proposed approach is demonstrated through two works: (1) an ultra-compact and robust PUF based on voltage-compensated proportional-to-absolute-temperature voltage generators that occupies 8.3× less area than the previous work with the similar robustness and twice the robustness of the previously most compact PUF design and (2) a technique to transform a 6T-SRAM array into a robust analog PUF with minimal overhead. In this work, similar circuit topology is used to transform a preexisting on-chip SRAM into a PUF, which further reduces the area in (1) with no robustness penalty. In this thesis, we also explore techniques for power management circuit design. Energy harvesting is an essential functionality in an IoT sensor node, where battery replacement is cost-prohibitive or impractical. Yet, existing energy-harvesting power management units (EH PMU) suffer from efficiency loss in the two-step voltage conversion: harvester-to-battery and battery-to-load. We propose an EH PMU architecture with hybrid energy storage, where a capacitor is introduced in addition to the battery to serve as an intermediate energy buffer to minimize the battery involvement in the system energy flow. Test-case measurements show as much as a 2.2× improvement in the end-to-end energy efficiency. In contrast, with the drastically reduced power consumption of IoT nodes that operates in the sub-threshold regime, adaptive dynamic voltage scaling (DVS) for supply-voltage margin removal, fully on-chip integration and high power conversion efficiency (PCE) are required in PMU designs. We present a PMU–load co-design based on a fully integrated switched-capacitor DC-DC converter (SC-DC) and hybrid error/replica-based regulation for a fully digital PMU control. The PMU is integrated with a neural spike processor (NSP) that achieves a record-low power consumption of 0.61 µW for 96 channels. A tunable replica circuit is added to assist the error regulation and prevent loss of regulation. With automatic energy-robustness co-optimization, the PMU can set the SC-DC’s optimal conversion ratio and switching frequency. The PMU achieves a PCE of 77.7% (72.2%) at VIN = 0.6 V (1 V) and at the NSP’s margin-free operating point

Columbia University Academic Commons

Recommended from our members

Charge Trap Transistors (CTT): Turning Logic Transistors into Embedded Non-Volatile Memory for Advanced High-k/Metal Gate CMOS Technologies

Author: Khan Faraz
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

While need for embedded non-volatile memory (eNVM) in modern computing systems continues to grow rapidly, the options have been limited due to integration and scaling challenges as well as operational voltage incompatibilities. Introduced in this work is a unique multi-time programmable memory (MTPM) solution for advanced high-k/metal-gate (HKMG) CMOS technologies which turns as-fabricated standard logic transistors into eNVM elements, without the need for any process adders or additional masks. These logic transistors, when employed as eNVM elements, are dubbed “Charge Trap Transistors” (CTTs). The fundamental device physics, principles of operation, and technological breakthroughs required for employing logic transistors as eNVM are presented. Implementation of CTT eNVM in 32 nm, 22 nm, 14 nm, and 7 nm production technologies has been realized and demonstrated in this work. The emerging memory technology landscape and the space that the CTT technology occupies therein are examined.The motivation behind this work is to develop an eNVM technology that is completely process/mask-free, multi-time programmable, operable at low/logic-compatible voltages, scalable, and secure. The CTT technology satisfies all of the aforementioned criteria. CTTs offer a data retention lifetime of > 10 years at 125 �C and an operation temperature range of -55�-125� C. Hardware results demonstrate an endurance of > 10^4 program/erase cycles which is more than adequate for most embedded applications. Hardware security enhancement, on-chip reconfigurable encryption, firmware, BIOS, chip ID, redundancy, repair at wafer and module test and in the field, performance tailoring, and chip configuration are a few of the applications of CTT eNVM. Moreover, the CTT array in its native (unprogrammed) state measures very well as an entropy source for potential PUF (Physically Unclonable Function) applications such as identification, authentication, anti-counterfeiting, secure boot, and cryptographic IP. In addition to the numerous digital applications, CTTs can also be utilized as an analog memory for applications like neuromorphic computing for machine learning (ML) and artificial intelligence (AI)

eScholarship - University of California

Recommended from our members

In-situ and In-field temperature and transistor BTI sensing techniques with microprocessor level implementation

Author: Yang Teng
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2022
Field of study

In modern deep-scaled CMOS technologies, various silicon-related pitfalls present challenges to the long-term performance of microprocessors. Such challenges include (1) local hot spots, which breach the thermal limitations of a microprocessor, and (2) transistor aging, especially NBTI, which degrades transistor threshold voltage, ultimately threatening the reliability of the entire memory block. In previous systems, the dummy circuit was placed next to the subject, where the dummy was frequently analyzed, and the readout was used to infer the condition of the target. Due to rapidly changing ambient conditions (e.g., temperature and voltage) and the potential scale of the target dimensions, such metrics may not accurately represent the condition of the target. Moreover, such temperature sensors and canary circuits occupy a significant area. Therefore, it would be highly preferable to monitor the target circuit in-situ, i.e., to sense the precise transistor at operation. It is also important to achieve an accurate sensing metric. When the temperature is analyzed, the readout should account for voltage and process variations. While sensing the aging degradation, the readout should account for voltage and temperature fluctuations. This would allow testing during in-field operation, while the circuits achieve area-efficiency. This research had two stages. One result of the first stage was a silicon test chip that was a compact temperature sensor. It involved a family of PTAT+CTAT sensor front-ends that unitized only 6 to 8 conventional CMOS logic devices, yielding a smaller sized chip. The sensor demonstrates accuracy within the target and achieves a 14.3x smaller foot print than preceding published designs. The second product of the first stage was a PMOS aging sensor used in 6T SRAM circuits. The test chip has a real SRAM array, integrated with the proposed PMOS NBTI sensor. It can sense real PMOS NBTI effects in any bit cell (in-situ) and provide robust readings of temperature and voltage (in-field). Intensive aging tests validated the proposed sensing technique. The second stage was focused on implementing the in-situ and in-field sensing techniques in a real processor. The MIPS microprocessor had a modified instruction cache (I$) and instruction set architecture. With the addition of new instruction aging sensing and minor modification of the circuits, the processor can execute aging sensing opportunistically to evaluate the aging level of its instruction cache. A software framework was developed and verified to estimate the retention voltage of the instruction cache over the lifetime of the chip. An area-efficient SoC was developed that could transform the instruction cache into an ambient temperature sensor. It had a physically unclonable function (PUF), and it was built with an area-saving technique similar to the earlier work. This thesis has four chapters. They are presented in chronological and they are aligned with the research described above

Columbia University Academic Commons

RESISTIVE RAM BASED MAIN-MEMORY SYSTEMS: UNDERSTANDING THE OPPORTUNITIES, LIMITATIONS, AND TRADEOFFS

Author: Jagasivamani Meenatchi
Publication venue
Publication date: 01/01/2020
Field of study

As DRAM faces scaling issues as a high-density memory, emerging technologies are being explored as alternatives. One promising candidate is Resistive Memories (ReRAM), which is scalable, vertically stackable, and because of the possibility of integration with standard logic process, can deliver higher density as a main-memory solution. The key differentiator with this approach involves a ReRAM memory array that integrates directly with a logic processor underneath. In this research work, I explore ReRAM as a main-memory alternative at three levels of detail – at the device level, the physical-design level, and finally at the architecture level. I begin with an overview of ReRAM and compare with alternate technologies. I look at the physical design of the solution and present the results of area studies on integrating a VSCALE processor at the 45nm technology node with a ReRAM bit-cell array. The area study was performed based on parameters specified by my collaborators at Crossbar Inc. The results showed that the optimum operating point is at 50% array efficiency with a VSCALE processor, and that this configuration incurs an area penalty of 18%. Two of the key challenges for ReRAM with respect to DRAM performance include the higher write latency requirement (typically on the order of 1us) and the lower write endurance (typically less than 10^8 cycles). This compares with DRAM write-latency times of less than 30ns (depending on technology node and generation) and write endurance of more than 10^15 write cycles. In this research work, I explore the possibility of utilizing the ReRAM cell in an intermediate state between non-volatile state and threshold state, where I intentionally tradeoff the write energy for a much lower data retention. This allows the chip to more easily replace existing DRAM-like main memory applications, without requiring higher write programming current or accommodating for a longer write latency. I performed this evaluation both at the device-level and at the architecture level. At the device-level, I used UMD’s Nano-fab lab to construct a Metal-Oxide based ReRAM bitcells on which I characterized the relationship between data-retention and write current applied. My fabricated ReRAM was composed of Titanium-Oxide and Aluminum Oxide. I also confirmed the behavior of a mixed-volatility state where a formed filament relaxes over time to move to a high-resistance level. Based on my experimental measurements, operating in the mixed volatile state would reduce write energy by 10 to 100x, and thereby improve the write endurance. Finally, at the architecture-level, I used the Structural Simulation Toolkit (SST) to characterize a ReRAM-based main-memory system and compare with a DRAM-based one using our research group’s DRAMSIM3 tool. I also characterized the sensitivity of various architectural parameters (core-to-memory controller ratio, queue depth, NoC topology) on system performance on stream and gups-based graph benchmarks which indicated that the torus topology will provide reasonable performance. Impact of the number of parallel processors indicated that at low processor counts, DRAM outperforms ReRAM due to its faster memory latency. However, at high processor counts, ReRAM with its higher number of parallel connections is able to deliver higher system performance than DRAM

Digital Repository at the University of Maryland

ENERGY-EFFICIENT FEATURE EXTRACTION ENGINE AND SECURE CHIP IDENTIFICATION FOR UBIQUITOUS SURVEILLANCE

Author: ANASTACIA ALVAREZ
Publication venue
Publication date: 29/12/2016
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Deep in-memory computing

Author: Kang Mingu
Publication venue
Publication date: 01/08/2017
Field of study

There is much interest in embedding data analytics into sensor-rich platforms such as wearables, biomedical devices, autonomous vehicles, robots, and Internet-of-Things to provide these with decision-making capabilities. Such platforms often need to implement machine learning (ML) algorithms under stringent energy constraints with battery-powered electronics. Especially, energy consumption in memory subsystems dominates such a system's energy efficiency. In addition, the memory access latency is a major bottleneck for overall system throughput. To address these issues in memory-intensive inference applications, this dissertation proposes deep in-memory accelerator (DIMA), which deeply embeds computation into the memory array, employing two key principles: (1) accessing and processing multiple rows of memory array at a time, and (2) embedding pitch-matched low-swing analog processing at the periphery of bitcell array. The signal-to-noise ratio (SNR) is budgeted by employing low-swing operations in both memory read and processing to exploit the application level's error immunity for aggressive energy efficiency. This dissertation first describes the system rationale underlying the DIMA's processing stages by identifying the common functional flow across a diverse set of inference algorithms. Based on the analysis, this dissertation presents a multi-functional DIMA to support four algorithms: support vector machine (SVM), template matching (TM), k-nearest neighbor (k-NN), and matched filter. The circuit and architectural level design techniques and guidelines are provided to address the challenges in achieving multi-functionality. A prototype integrated circuit (IC) of a multi-functional DIMA was fabricated with a 16 KB SRAM array in a 65 nm CMOS process. Measurement results show up to 5.6X and 5.8X energy and delay reductions leading to 31X energy delay product (EDP) reduction with negligible (<1%) accuracy degradation as compared to the conventional 8-b fixed-point digital implementation optimally designed for each algorithm. Then, DIMA also has been applied to more complex algorithms: (1) convolutional neural network (CNN), (2) sparse distributed memory (SDM), and (3) random forest (RF). System-level simulations of CNN using circuit behavioral models in a 45 nm SOI CMOS demonstrate that high probability (>0.99) of handwritten digit recognition can be achieved using the MNIST database, along with a 24.5X reduced EDP, a 5.0X reduced energy, and a 4.9X higher throughput as compared to the conventional system. The DIMA-based SDM architecture also achieves up to 25X and 12X delay and energy reductions, respectively, over conventional SDM with negligible accuracy degradation (within 0.4%) for 16X16 binary-pixel image classification. A DIMA-based RF was realized as a prototype IC with a 16 KB SRAM array in a 65 nm process. To the best of our knowledge, this is the first IC realization of an RF algorithm. The measurement results show that the prototype achieves a 6.8X lower EDP compared to a conventional design at the same accuracy (94%) for an eight-class traffic sign recognition problem. The multi-functional DIMA and extension to other algorithms naturally motivated us to consider a programmable DIMA instruction set architecture (ISA), namely MATI. This dissertation explores a synergistic combination of the instruction set, architecture and circuit design to achieve the programmability without losing DIMA's energy and throughput benefits. Employing silicon-validated energy, delay and behavioral models of deep in-memory components, we demonstrate that MATI is able to realize nine ML benchmarks while incurring negligible overhead in energy (< 0.1%), and area (4.5%), and in throughput, over a fixed four-function DIMA. In this process, MATI is able to simultaneously achieve enhancements in both energy (2.5X to 5.5X) and throughput (1.4X to 3.4X) for an overall EDP improvement of up to 12.6X over fixed-function digital architectures

Illinois Digital Environment for Access to Learning and Scholarship Repository

Compilers for portable programming of heterogeneous parallel & approximate computing systems

Author: Srivastava Prakalp
Publication venue
Publication date: 01/05/2019
Field of study

Programming heterogeneous systems such as the System-on-chip (SoC) processors in modern mobile devices can be extremely complex because a single system may include multiple different parallelism models, instruction sets, memory hierarchies, and systems use different combinations of these features. This is further complicated by software and hardware approximate computing optimizations. Different compute units on an SoC use different approximate computing methods and an application would usually be composed of multiple compute kernels, each one specialized to run on a different hardware. Determining how best to map such an application to a modern heterogeneous system is an open research problem. First, we propose a parallel abstraction of heterogeneous hardware that is a carefully chosen combination of well-known parallel models and is able to capture the parallelism in a wide range of popular parallel hardware. This abstraction uses a hierarchical dataflow graph with side effects and vector SIMD instructions. We use this abstraction to define a parallel program representation called HPVM that aims to address both functional portability and performance portability across heterogeneous systems. Second, we further extend HPVM representation to enable accuracy-aware performance and energy tuning on heterogeneous systems with multiple compute units and approximation methods. We call it ApproxHPVM, and it automatically translates end-to-end application-level accuracy constraints into accuracy requirements for individual operations. ApproxHPVM uses a hardware-agnostic accuracy-tuning phase to do this translation, which greatly speeds up the analysis, enables greater portability, and enables future capabilities like accuracy-aware dynamic scheduling and design space exploration. We have implemented a prototype HPVM system, defining the HPVM IR as an extension of the LLVM compiler IR, compiler optimizations that operate directly on HPVM graphs, and code generators that translate the virtual ISA to NVIDIA GPUs, Intel’s AVX vector units, and to multicore X86-64 processors. Experimental results show that HPVM optimizations achieve significant performance improvements, HPVM translators achieve performance competitive with manually developed OpenCL code for both GPUs and vector hardware, and that runtime scheduling policies can make use of both program and runtime information to exploit the flexible compilation capabilities. Furthermore, our evaluation of ApproxHPVM shows that our framework can offload chunks of approximable computations to special purpose accelerators that provide significant gains in performance and energy, while staying within a user-specified application-level accuracy constraint with high probability

Illinois Digital Environment for Access to Learning and Scholarship Repository

Energy efficient core designs for upcoming process technologies

Author: Gopi Reddy Bhargava Reddy
Publication venue
Publication date: 01/05/2019
Field of study

Energy efficiency has been a first order constraint in the design of micro processors for the last decade. As Moore's law sunsets, new technologies are being actively explored to extend the march in increasing the computational power and efficiency. It is essential for computer architects to understand the opportunities and challenges in utilizing the upcoming process technology trends in order to design the most efficient processors. In this work, we consider three process technology trends and propose core designs that are best suited for each of the technologies. The process technologies are expected to be viable over a span of timelines. We first consider the most popular method currently available to improve the energy efficiency, i.e. by lowering the operating voltage. We make key observations regarding the limiting factors in scaling down the operating voltage for general purpose high performance processors. Later, we propose our novel core design, ScalCore, one that can work in high performance mode at nominal Vdd, and in a very energy-efficient mode at low Vdd. The resulting core design can operate at much lower voltages providing higher parallel performance while consuming lower energy. While lowering Vdd improves the energy efficiency, CMOS devices are fundamentally limited in their low voltage operation. Therefore, we next consider an upcoming device technology -- Tunneling Field-Effect Transistors (TFETs), that is expected to supplement CMOS device technology in the near future. TFETs can attain much higher energy efficiency than CMOS at low voltages. However, their performance saturates at high voltages and, therefore, cannot entirely replace CMOS when high performance is needed. Ideally, we desire a core that is as energy-efficient as TFET and provides as much performance as CMOS. To reach this goal, we characterize the TFET device behavior for core design and judiciously integrate TFET units, CMOS units in a single core. The resulting core, called HetCore, can provide very high energy efficiency while limiting the slowdown when compared to a CMOS core. Finally, we analyze Monolithic 3D (M3D) integration technology that is widely considered to be the only way to integrate more transistors on a chip. We present the first analysis of the architectural implications of using M3D for core design and show how to partition the core across different layers. We also address one of the key challenges in realizing the technology, namely, the top layer performance degradation. We propose a critical path based partitioning for logic stages and asymmetric bit/port partitioning for storage stages. The result is a core that performs nearly as well as a core without any top layer slowdown. When compared to a 2D baseline design, an M3D core not only provides much higher performance, it also reduces the energy consumption at the same time. In summary, this thesis addresses one of the fundamental challenges in computer architecture -- overcoming the fact that CMOS is not scaling anymore. As we increase the computing power on a single chip, our ability to power the entire chip keeps decreasing. This thesis proposes three solutions aimed at solving this problem over different timelines. Across all our solutions, we improve energy efficiency without compromising the performance of the core. As a result, we are able to operate twice as many cores with in the same power budget as regular cores, significantly alleviating the problem of dark silicon

Illinois Digital Environment for Access to Learning and Scholarship Repository