11 research outputs found

    Heterogeneous Chip Multiprocessor: Data Representation, Mixed-Signal Processing Tiles, and System Design

    Get PDF
    With the emergence of big data, the need for more computationally intensive processors that can handle the increased processing demand has risen. Conventional computing paradigms based on the Von Neumann model that separates computational and memory structures have become outdated and less efficient for this increased demand. As the speed and memory density of processors have increased significantly over the years, these models of computing, which rely on a constant stream of data between the processor and memory, see less gains due to finite bandwidth and latency. Moreover, in the presence of extreme scaling, these conventional systems, implemented in submicron integrated circuits, have become even more susceptible to process variability, static leakage current, and more. In this work, alternative paradigms, predicated on distributive processing with robust data representation and mixed-signal processing tiles, are explored for constructing more efficient and scalable computing systems in application specific integrated circuits (ASICs). The focus of this dissertation work has been on heterogeneous chip multi-processor (CMP) design and optimization across different levels of abstraction. On the level of data representation, a different modality of representation based on random pulse density modulation (RPDM) coding is explored for more efficient processing using stochastic computation. On the level of circuit description, mixed-signal integrated circuits that exploit charge-based computing for energy efficient fixed point arithmetic are designed. Consequently, 8 different chips that test and showcase these circuits were fabricated in submicron CMOS processes. Finally, on the architectural level of description, a compact instruction-set processor and controller that facilitates distributive computing on System-On-Chips (SoCs) is designed. In addition to this, a robust bufferless network architecture is designed with a network simulator, and I/O cells are designed for SoCs. The culmination of this thesis work has led to the design and fabrication of a heterogeneous chip multi- processor prototype comprised of over 12,000 VVM cores, warp/dewarp processors, cache, and additional processors, which can be applied towards energy efficient large-scale data processing

    Reclaiming Fault Resilience and Energy Efficiency With Enhanced Performance in Low Power Architectures

    Get PDF
    Rapid developments of the AI domain has revolutionized the computing industry by the introduction of state-of-art AI architectures. This growth is also accompanied by a massive increase in the power consumption. Near-Theshold Computing (NTC) has emerged as a viable solution by offering significant savings in power consumption paving the way for an energy efficient design paradigm. However, these benefits are accompanied by a deterioration in performance due to the severe process variation and slower transistor switching at Near-Threshold operation. These problems severely restrict the usage of Near-Threshold operation in commercial applications. In this work, a novel AI architecture, Tensor Processing Unit, operating at NTC is thoroughly investigated to tackle the issues hindering system performance. Research problems are demonstrated in a scientific manner and unique opportunities are explored to propose novel design methodologies

    Predicting power scalability in a reconfigurable platform

    Get PDF
    This thesis focuses on the evolution of digital hardware systems. A reconfigurable platform is proposed and analysed based on thin-body, fully-depleted silicon-on-insulator Schottky-barrier transistors with metal gates and silicide source/drain (TBFDSBSOI). These offer the potential for simplified processing that will allow them to reach ultimate nanoscale gate dimensions. Technology CAD was used to show that the threshold voltage in TBFDSBSOI devices will be controllable by gate potentials that scale down with the channel dimensions while remaining within appropriate gate reliability limits. SPICE simulations determined that the magnitude of the threshold shift predicted by TCAD software would be sufficient to control the logic configuration of a simple, regular array of these TBFDSBSOI transistors as well as to constrain its overall subthreshold power growth. Using these devices, a reconfigurable platform is proposed based on a regular 6-input, 6-output NOR LUT block in which the logic and configuration functions of the array are mapped onto separate gates of the double-gate device. A new analytic model of the relationship between power (P), area (A) and performance (T) has been developed based on a simple VLSI complexity metric of the form ATσ = constant. As σ defines the performance “return” gained as a result of an increase in area, it also represents a bound on the architectural options available in power-scalable digital systems. This analytic model was used to determine that simple computing functions mapped to the reconfigurable platform will exhibit continuous power-area-performance scaling behavior. A number of simple arithmetic circuits were mapped to the array and their delay and subthreshold leakage analysed over a representative range of supply and threshold voltages, thus determining a worse-case range for the device/circuit-level parameters of the model. Finally, an architectural simulation was built in VHDL-AMS. The frequency scaling described by σ, combined with the device/circuit-level parameters predicts the overall power and performance scaling of parallel architectures mapped to the array

    Energy-Efficient Receiver Design for High-Speed Interconnects

    Get PDF
    High-speed interconnects are of vital importance to the operation of high-performance computing and communication systems, determining the ultimate bandwidth or data rates at which the information can be exchanged. Optical interconnects and the employment of high-order modulation formats are considered as the solutions to fulfilling the envisioned speed and power efficiency of future interconnects. One common key factor in bringing the success is the availability of energy-efficient receivers with superior sensitivity. To enhance the receiver sensitivity, improvement in the signal-to-noise ratio (SNR) of the front-end circuits, or equalization that mitigates the detrimental inter-symbol interference (ISI) is required. In this dissertation, architectural and circuit-level energy-efficient techniques serving these goals are presented. First, an avalanche photodetector (APD)-based optical receiver is described, which utilizes non-return-to-zero (NRZ) modulation and is applicable to burst-mode operation. For the purposes of improving the overall optical link energy efficiency as well as the link bandwidth, this optical receiver is designed to achieve high sensitivity and high reconfiguration speed. The high sensitivity is enabled by optimizing the SNR at the front-end through adjusting the APD responsivity via its reverse bias voltage, along with the incorporation of 2-tap feedforward equalization (FFE) and 2-tap decision feedback equalization (DFE) implemented in current-integrating fashion. The high reconfiguration speed is empowered by the proposed integrating dc and amplitude comparators, which eliminate the RC settling time constraints. The receiver circuits, excluding the APD die, are fabricated in 28-nm CMOS technology. The optical receiver achieves bit-error-rate (BER) better than 1E−12 at −16-dBm optical modulation amplitude (OMA), 2.24-ns reconfiguration time with 5-dB dynamic range, and 1.37-pJ/b energy efficiency at 25 Gb/s. Second, a 4-level pulse amplitude modulation (PAM4) wireline receiver is described, which incorporates continuous time linear equalizers (CTLEs) and a 2-tap direct DFE dedicated to the compensation for the first and second post-cursor ISI. The direct DFE in a PAM4 receiver (PAM4-DFE) is made possible by the proposed CMOS track-and-regenerate slicer. This proposed slicer offers rail-to-rail digital feedback signals with significantly improved clock-to-Q delay performance. The reduced slicer delay relaxes the settling time constraint of the summer circuits and allows the stringent DFE timing constraint to be satisfied. With the availability of a direct DFE employing the proposed slicer, inductor-based bandwidth enhancement and loop-unrolling techniques, which can be power/area intensive, are not required. Fabricated in 28-nm CMOS technology, the PAM4 receiver achieves BER better than 1E−12 and 1.1-pJ/b energy efficiency at 60 Gb/s, measured over a channel with 8.2-dB loss at Nyquist frequency. Third, digital neural-network-enhanced FFEs (NN-FFEs) for PAM4 analog-to-digital converter (ADC)-based optical interconnects are described. The proposed NN-FFEs employ a custom learnable piecewise linear (PWL) activation function to tackle the nonlinearities with short memory lengths. In contrast to the conventional Volterra equalizers where multipliers are utilized to generate the nonlinear terms, the proposed NN-FFEs leverage the custom PWL activation function for nonlinear operations and reduce the required number of multipliers, thereby improving the area and power efficiencies. Applications in the optical interconnects based on micro-ring modulators (MRMs) are demonstrated with simulation results of 50-Gb/s and 100-Gb/s links adopting PAM4 signaling. The proposed NN-FFEs and the conventional Volterra equalizers are synthesized with the standard-cell libraries in a commercial 28-nm CMOS technology, and their power consumptions and performance are compared. Better than 37% lower power overhead can be achieved by employing the proposed NN-FFEs, in comparison with the Volterra equalizer that leads to similar improvement in the symbol-error-rate (SER) performance.</p

    Exploring Spin-transfer-torque devices and memristors for logic and memory applications

    Get PDF
    As scaling CMOS devices is approaching its physical limits, researchers have begun exploring newer devices and architectures to replace CMOS. Due to their non-volatility and high density, Spin Transfer Torque (STT) devices are among the most prominent candidates for logic and memory applications. In this research, we first considered a new logic style called All Spin Logic (ASL). Despite its advantages, ASL consumes a large amount of static power; thus, several optimizations can be performed to address this issue. We developed a systematic methodology to perform the optimizations to ensure stable operation of ASL. Second, we investigated reliable design of STT-MRAM bit-cells and addressed the conflicting read and write requirements, which results in overdesign of the bit-cells. Further, a Device/Circuit/Architecture co-design framework was developed to optimize the STT-MRAM devices by exploring the design space through jointly considering yield enhancement techniques at different levels of abstraction. Recent advancements in the development of memristive devices have opened new opportunities for hardware implementation of non-Boolean computing. To this end, the suitability of memristive devices for swarm intelligence algorithms has enabled researchers to solve a maze in hardware. In this research, we utilized swarm intelligence of memristive networks to perform image edge detection. First, we proposed a hardware-friendly algorithm for image edge detection based on ant colony. Next, we designed the image edge detection algorithm using memristive networks

    Degradation Models and Optimizations for CMOS Circuits

    Get PDF
    Die Gewährleistung der Zuverlässigkeit von CMOS-Schaltungen ist derzeit eines der größten Herausforderungen beim Chip- und Schaltungsentwurf. Mit dem Ende der Dennard-Skalierung erhöht jede neue Generation der Halbleitertechnologie die elektrischen Felder innerhalb der Transistoren. Dieses stärkere elektrische Feld stimuliert die Degradationsphänomene (Alterung der Transistoren, Selbsterhitzung, Rauschen, usw.), was zu einer immer stärkeren Degradation (Verschlechterung) der Transistoren führt. Daher erleiden die Transistoren in jeder neuen Technologiegeneration immer stärkere Verschlechterungen ihrer elektrischen Parameter. Um die Funktionalität und Zuverlässigkeit der Schaltung zu wahren, wird es daher unerlässlich, die Auswirkungen der geschwächten Transistoren auf die Schaltung präzise zu bestimmen. Die beiden wichtigsten Auswirkungen der Verschlechterungen sind ein verlangsamtes Schalten, sowie eine erhöhte Leistungsaufnahme der Schaltung. Bleiben diese Auswirkungen unberücksichtigt, kann die verlangsamte Schaltgeschwindigkeit zu Timing-Verletzungen führen (d.h. die Schaltung kann die Berechnung nicht rechtzeitig vor Beginn der nächsten Operation abschließen) und die Funktionalität der Schaltung beeinträchtigen (fehlerhafte Ausgabe, verfälschte Daten, usw.). Um diesen Verschlechterungen der Transistorparameter im Laufe der Zeit Rechnung zu tragen, werden Sicherheitstoleranzen eingeführt. So wird beispielsweise die Taktperiode der Schaltung künstlich verlängert, um ein langsameres Schaltverhalten zu tolerieren und somit Fehler zu vermeiden. Dies geht jedoch auf Kosten der Performanz, da eine längere Taktperiode eine niedrigere Taktfrequenz bedeutet. Die Ermittlung der richtigen Sicherheitstoleranz ist entscheidend. Wird die Sicherheitstoleranz zu klein bestimmt, führt dies in der Schaltung zu Fehlern, eine zu große Toleranz führt zu unnötigen Performanzseinbußen. Derzeit verlässt sich die Industrie bei der Zuverlässigkeitsbestimmung auf den schlimmstmöglichen Fall (maximal gealterter Schaltkreis, maximale Betriebstemperatur bei minimaler Spannung, ungünstigste Fertigung, etc.). Diese Annahme des schlimmsten Falls garantiert, dass der Chip (oder integrierte Schaltung) unter allen auftretenden Betriebsbedingungen funktionsfähig bleibt. Darüber hinaus ermöglicht die Betrachtung des schlimmsten Falles viele Vereinfachungen. Zum Beispiel muss die eigentliche Betriebstemperatur nicht bestimmt werden, sondern es kann einfach die schlimmstmögliche (sehr hohe) Betriebstemperatur angenommen werden. Leider lässt sich diese etablierte Praxis der Berücksichtigung des schlimmsten Falls (experimentell oder simulationsbasiert) nicht mehr aufrechterhalten. Diese Berücksichtigung bedingt solch harsche Betriebsbedingungen (maximale Temperatur, etc.) und Anforderungen (z.B. 25 Jahre Betrieb), dass die Transistoren unter den immer stärkeren elektrischen Felder enorme Verschlechterungen erleiden. Denn durch die Kombination an hoher Temperatur, Spannung und den steigenden elektrischen Feldern bei jeder Generation, nehmen die Degradationphänomene stetig zu. Das bedeutet, dass die unter dem schlimmsten Fall bestimmte Sicherheitstoleranz enorm pessimistisch ist und somit deutlich zu hoch ausfällt. Dieses Maß an Pessimismus führt zu erheblichen Performanzseinbußen, die unnötig und demnach vermeidbar sind. Während beispielsweise militärische Schaltungen 25 Jahre lang unter harschen Bedingungen arbeiten müssen, wird Unterhaltungselektronik bei niedrigeren Temperaturen betrieben und muss ihre Funktionalität nur für die Dauer der zweijährigen Garantie aufrechterhalten. Für letzteres können die Sicherheitstoleranzen also deutlich kleiner ausfallen, um die Performanz deutlich zu erhöhen, die zuvor im Namen der Zuverlässigkeit aufgegeben wurde. Diese Arbeit zielt darauf ab, maßgeschneiderte Sicherheitstoleranzen für die einzelnen Anwendungsszenarien einer Schaltung bereitzustellen. Für fordernde Umgebungen wie Weltraumanwendungen (wo eine Reparatur unmöglich ist) ist weiterhin der schlimmstmögliche Fall relevant. In den meisten Anwendungen, herrschen weniger harsche Betriebssbedingungen (z.B. sorgen Kühlsysteme für niedrigere Temperaturen). Hier können Sicherheitstoleranzen maßgeschneidert und anwendungsspezifisch bestimmt werden, sodass Verschlechterungen exakt toleriert werden können und somit die Zuverlässigkeit zu minimalen Kosten (Performanz, etc.) gewahrt wird. Leider sind die derzeitigen Standardentwurfswerkzeuge für diese anwendungsspezifische Bestimmung der Sicherheitstoleranz nicht gut gerüstet. Diese Arbeit zielt darauf ab, Standardentwurfswerkzeuge in die Lage zu versetzen, diesen Bedarf an Zuverlässigkeitsbestimmungen für beliebige Schaltungen unter beliebigen Betriebsbedingungen zu erfüllen. Zu diesem Zweck stellen wir unsere Forschungsbeiträge als vier Schritte auf dem Weg zu anwendungsspezifischen Sicherheitstoleranzen vor: Schritt 1 verbessert die Modellierung der Degradationsphänomene (Transistor-Alterung, -Selbsterhitzung, -Rauschen, etc.). Das Ziel von Schritt 1 ist es, ein umfassendes, einheitliches Modell für die Degradationsphänomene zu erstellen. Durch die Verwendung von materialwissenschaftlichen Defektmodellierungen werden die zugrundeliegenden physikalischen Prozesse der Degradationsphänomena modelliert, um ihre Wechselwirkungen zu berücksichtigen (z.B. Phänomen A kann Phänomen B beschleunigen) und ein einheitliches Modell für die simultane Modellierung verschiedener Phänomene zu erzeugen. Weiterhin werden die jüngst entdeckten Phänomene ebenfalls modelliert und berücksichtigt. In Summe, erlaubt dies eine genaue Degradationsmodellierung von Transistoren unter gleichzeitiger Berücksichtigung aller essenziellen Phänomene. Schritt 2 beschleunigt diese Degradationsmodelle von mehreren Minuten pro Transistor (Modelle der Physiker zielen auf Genauigkeit statt Performanz) auf wenige Millisekunden pro Transistor. Die Forschungsbeiträge dieser Dissertation beschleunigen die Modelle um ein Vielfaches, indem sie zuerst die Berechnungen so weit wie möglich vereinfachen (z.B. sind nur die Spitzenwerte der Degradation erforderlich und nicht alle Werte über einem zeitlichen Verlauf) und anschließend die Parallelität heutiger Computerhardware nutzen. Beide Ansätze erhöhen die Auswertungsgeschwindigkeit, ohne die Genauigkeit der Berechnung zu beeinflussen. In Schritt 3 werden diese beschleunigte Degradationsmodelle in die Standardwerkzeuge integriert. Die Standardwerkzeuge berücksichtigen derzeit nur die bestmöglichen, typischen und schlechtestmöglichen Standardzellen (digital) oder Transistoren (analog). Diese drei Typen von Zellen/Transistoren werden von der Foundry (Halbleiterhersteller) aufwendig experimentell bestimmt. Da nur diese drei Typen bestimmt werden, nehmen die Werkzeuge keine Zuverlässigkeitsbestimmung für eine spezifische Anwendung (Temperatur, Spannung, Aktivität) vor. Simulationen mit Degradationsmodellen ermöglichen eine Bestimmung für spezifische Anwendungen, jedoch muss diese Fähigkeit erst integriert werden. Diese Integration ist eines der Beiträge dieser Dissertation. Schritt 4 beschleunigt die Standardwerkzeuge. Digitale Schaltungsentwürfe, die nicht auf Standardzellen basieren, sowie komplexe analoge Schaltungen können derzeit nicht mit analogen Schaltungssimulatoren ausgewertet werden. Ihre Performanz reicht für solch umfangreiche Simulationen nicht aus. Diese Dissertation stellt Techniken vor, um diese Werkzeuge zu beschleunigen und somit diese umfangreichen Schaltungen simulieren zu können. Diese Forschungsbeiträge, die sich jeweils über mehrere Veröffentlichungen erstrecken, ermöglichen es Standardwerkzeugen, die Sicherheitstoleranz für kundenspezifische Anwendungsszenarien zu bestimmen. Für eine gegebene Schaltungslebensdauer, Temperatur, Spannung und Aktivität (Schaltverhalten durch Software-Applikationen) können die Auswirkungen der Transistordegradation ausgewertet werden und somit die erforderliche (weder unter- noch überschätzte) Sicherheitstoleranz bestimmt werden. Diese anwendungsspezifische Sicherheitstoleranz, garantiert die Zuverlässigkeit und Funktionalität der Schaltung für genau diese Anwendung bei minimalen Performanzeinbußen

    CIRCUITS AND ARCHITECTURE FOR BIO-INSPIRED AI ACCELERATORS

    Get PDF
    Technological advances in microelectronics envisioned through Moore’s law have led to powerful processors that can handle complex and computationally intensive tasks. Nonetheless, these advancements through technology scaling have come at an unfavorable cost of significantly larger power consumption, which has posed challenges for data processing centers and computers at scale. Moreover, with the emergence of mobile computing platforms constrained by power and bandwidth for distributed computing, the necessity for more energy-efficient scalable local processing has become more significant. Unconventional Compute-in-Memory architectures such as the analog winner-takes-all associative-memory and the Charge-Injection Device processor have been proposed as alternatives. Unconventional charge-based computation has been employed for neural network accelerators in the past, where impressive energy efficiency per operation has been attained in 1-bit vector-vector multiplications, and in recent work, multi-bit vector-vector multiplications. In the latter, computation was carried out by counting quanta of charge at the thermal noise limit, using packets of about 1000 electrons. These systems are neither analog nor digital in the traditional sense but employ mixed-signal circuits to count the packets of charge and hence we call them Quasi-Digital. By amortizing the energy costs of the mixed-signal encoding/decoding over compute-vectors with many elements, high energy efficiencies can be achieved. In this dissertation, I present a design framework for AI accelerators using scalable compute-in-memory architectures. On the device level, two primitive elements are designed and characterized as target computational technologies: (i) a multilevel non-volatile cell and (ii) a pseudo Dynamic Random-Access Memory (pseudo-DRAM) bit-cell. At the level of circuit description, compute-in-memory crossbars and mixed-signal circuits were designed, allowing seamless connectivity to digital controllers. At the level of data representation, both binary and stochastic-unary coding are used to compute Vector-Vector Multiplications (VMMs) at the array level. Finally, on the architectural level, two AI accelerator for data-center processing and edge computing are discussed. Both designs are scalable multi-core Systems-on-Chip (SoCs), where vector-processor arrays are tiled on a 2-layer Network-on-Chip (NoC), enabling neighbor communication and flexible compute vs. memory trade-off. General purpose Arm/RISCV co-processors provide adequate bootstrapping and system-housekeeping and a high-speed interface fabric facilitates Input/Output to main memory

    Proceedings of the 21st Conference on Formal Methods in Computer-Aided Design – FMCAD 2021

    Get PDF
    The Conference on Formal Methods in Computer-Aided Design (FMCAD) is an annual conference on the theory and applications of formal methods in hardware and system verification. FMCAD provides a leading forum to researchers in academia and industry for presenting and discussing groundbreaking methods, technologies, theoretical results, and tools for reasoning formally about computing systems. FMCAD covers formal aspects of computer-aided system design including verification, specification, synthesis, and testing

    Modeling and optimization of Tunnel-FET architectures exploiting carrier gas dimensionality

    Get PDF
    The semiconductor industry, governed by the Moore's law, has achieved the almost unbelievable feat of exponentially increasing performance while lowering the costs for years. The main enabler for this achievement has been the scaling of the CMOS transistor that allowed the manufacturers to pack more and more functionality into the same chip area. However, it is now widely agreed that the happy days of scaling are well over and we are about to reach the physical limits of the CMOS concept. One major, insurmountable limit of CMOS is the so-called thermionic emission limit which dictates that the switching slope of the transistor cannot go below 60mV/dec at room temperature. This makes it impossible to scale down the supply voltage for CMOS transistor without dramatically increasing the static power consumption. To address this issue, a novel transistor concept called Tunnel FET (TFET) which utilizes the quantum mechanical band-to-band tunneling (BTBT) has been proposed. TFETs possess the potential to overcome the thermionic emission limit and therefore allow for low supply voltage operation. This thesis aims at investigating the performance of TFETs with alternative architectures exploiting quantized carrier gases through quantum mechanical simulations. To this end, 1D and 2D self-consistent Schrödinger-Poisson solvers with closed boundaries are developed along with the phonon-assisted and direct BTBT models implemented as a post-processing step. Moreover, we propose an efficient method to incorporate the quantization along the transverse direction which enables us to simulate different dimensionality combinations. The implemented models are calibrated against experimental and more fundamental quantum mechanical simulation methods such as k.p and tight-binding NEGF using tunneling diode structures. Using these tools, we simulate an advanced TFET architecture called electron-hole bilayer TFET (EHBTFET) which exploits BTBT between 2D electron and hole gases electrostatically induced by two separate oppositely biased gates. The subband-to-subband tunneling is first analyzed with the 1D simulator where the device working principle is demonstrated. Then, non-idealities of the EHBTFET operation such as the lateral tunneling and corner effects are investigated using the 2D simulator. The origin of the lateral leakage and techniques to reduce it are analyzed in detail. A parameter space analysis of the EHBTFET is performed by simulating a wide range of channel materials, channel thickness and oxide thicknesses. Our results indicate the possibility of having 2D-2D and 3D-3D tunneling for the EHBTFET, depending on the parameters chosen. A novel digital logic scheme utilizing the independent biasing property of the EHBTFET n- and p-gates is proposed and verified through quantum-corrected TCAD simulations. The performance benchmarking against a 28nm FD-SOI CMOS technology is performed as well. The results indicate that the EHBTFET logic can outperform the CMOS counterpart in the low supply voltage (subthreshold) regime, where it can offer significantly higher drive current due to its steep switching slope. We also compare the different dimensionality cases and highlight important differences between the face and edge tunneling devices in terms of their dependence on the device parameters (channel material, channel thickness and EOT)
    corecore