40 research outputs found

    A Survey on Low-Power Techniques with Emerging Technologies: From Devices to Systems

    Get PDF
    Nowadays, power consumption is one of the main limitations of electronic systems. In this context, novel and emerging devices provide us with new opportunities to keep the trend to low-power design. In this survey paper, we present a transversal survey on energy efficient techniques ranging from devices to architectures. The actual trends of device research, with fully-depleted planar devices, tri-gate geometries and gate-all-around structures, allows us to reach an increasingly higher level of performance while reducing the associated power. In addition, beyond the simple device properties enhancements, emerging devices also lead to innovations at circuit and architectural levels. In particular, devices whose properties can be tuned through additional terminals enable a fine and dynamic control of device threshold. They also enable designers to realize logic gates and to implement power-related techniques in a compact way unreachable to standard technologies. These innovations reduce the power consumption at the gate level and unlock new means of actuation in architectural solutions like adaptive voltage and frequency scaling

    FDSOI Design using Automated Standard-Cell-Grained Body Biasing

    Get PDF
    With the introduction of FDSOI processes at competitive technology nodes, body biasing on an unprecedented scale was made possible. Body biasing influences one of the central transistor characteristics, the threshold voltage. By being able to heighten or lower threshold voltage by more than 100mV, the very physics of transistor switching can be manipulated at run time. Furthermore, as body biasing does not lead to different signal levels, it can be applied much more fine-grained than, e.g., DVFS. With the state of the art mainly focused on combinations of body biasing with DVFS, it has thus ignored granularities unfeasible for DVFS. This thesis fills this gap by proposing body bias domain partitioning techniques and for body bias domain partitionings thereby generated, algorithms that search for body bias assignments. Several different granularities ranging from entire cores to small groups of standard cells were examined using two principal approaches: Designer aided pre-partitioning based determination of body bias domains and a first-time, fully automatized, netlist based approach called domain candidate exploration. Both approaches operate along the lines of activation and timing of standard cell groups. These approaches were evaluated using the example of a Dynamically Reconfigurable Processor (DRP), a highly efficient category of reconfigurable architectures which consists of an array of processing elements and thus offers many opportunities for generalization towards many-core architectures. Finally, the proposed methods were validated by manufacturing a test-chip. Extensive simulation runs as well as the test-chip evaluation showed the validity of the proposed methods and indicated substantial improvements in energy efficiency compared to the state of the art. These improvements were accomplished by the fine-grained partitioning of the DRP design. This method allowed reducing dynamic power through supply voltage levels yielding higher clock frequencies using forward body biasing, while simultaneously reducing static power consumption in unused parts.Die Einführung von FDSOI Prozessen in gegenwärtigen Prozessgrößen ermöglichte die Nutzung von Substratvorspannung in nie zuvor dagewesenem Umfang. Substratvorspannung beeinflusst unter anderem eine zentrale Eigenschaft von Transistoren, die Schwellspannung. Mittels Substratvorspannung kann diese um mehr als 100mV erhöht oder gesenkt werden, was es ermöglicht, die schiere Physik des Schaltvorgangs zu manipulieren. Da weiterhin hiervon der Signalpegel der digitalen Signale unberührt bleibt, kann diese Technik auch in feineren Granularitäten angewendet werden, als z.B. Dynamische Spannungs- und Frequenz Anpassung (Engl. Dynamic Voltage and Frequency Scaling, Abk. DVFS). Da jedoch der Stand der Technik Substratvorspannung hauptsächlich in Kombinationen mit DVFS anwendet, werden feinere Granularitäten, welche für DVFS nicht mehr wirtschaftlich realisierbar sind, nicht berücksichtigt. Die vorliegende Arbeit schließt diese Lücke, indem sie Partitionierungsalgorithmen zur Unterteilung eines Entwurfs in Substratvorspannungsdomänen vorschlägt und für diese hierdurch unterteilten Domänen entsprechende Substratvorspannungen berechnet. Hierzu wurden verschiedene Granularitäten berücksichtigt, von ganzen Prozessorkernen bis hin zu kleinen Gruppen von Standardzellen. Diese Entwürfe wurden dann mit zwei verschiedenen Herangehensweisen unterteilt: Chipdesigner unterstützte, vorpartitionierungsbasierte Bestimmung von Substratvorspannungsdomänen, sowie ein erstmals vollautomatisierter, Netzlisten basierter Ansatz, in dieser Arbeit Domänen Kandidaten Exploration genannt. Beide Ansätze funktionieren nach dem Prinzip der Aktivierung, d.h. zu welchem Zeitpunkt welcher Teil des Entwurfs aktiv ist, sowie der Signallaufzeit durch die entsprechenden Entwurfsteile. Diese Ansätze wurden anhand des Beispiels Dynamisch Rekonfigurierbarer Prozessoren (DRP) evaluiert. DRPs stellen eine Klasse hocheffizienter rekonfigurierbarer Architekturen dar, welche hauptsächlich aus einem Feld von Rechenelementen besteht und dadurch auch zahlreiche Möglichkeiten zur Verallgemeinerung hinsichtlich Many-Core Architekturen zulässt. Schließlich wurden die vorgeschlagenen Methoden in einem Testchip validiert. Alle ermittelten Ergebnisse zeigen im Vergleich zum Stand der Technik drastische Verbesserungen der Energieeffizienz, welche durch die feingranulare Unterteilung in Substratvorspannungsdomänen erzielt wurde. Hierdurch konnten durch die Anwendung von Substratvorspannung höhere Taktfrequenzen bei gleicher Versorgungsspannung erzielt werden, während zeitgleich in zeitlich unkritischen oder ungenutzten Entwurfsteilen die statische Leistungsaufnahme minimiert wurde

    Compensation of Threshold Voltage for Process and Temperature Variations in 28nm UTBB FDSOI

    Get PDF
    As technology scales down in order to meet demands of more computing power per area, a variety of challenges emerge. Devices with channel lengths of a few nano meters require atomic precision when they are manufactured. Small irregularities in the equipment and manufacturing environment can cause large process variations from die--to--die, but also within--die variations. Along with the increasing density of transistors per die which has led to severe performance issues due to temperature variations, these effects may seriously impact operation and cause large deviations in frequency and power across a the chip. This thesis will present the analysis and design of a circuit with the goal of compensating the threshold voltage, by means of body biasing, in order to mitigate process and temperature variations. The compensation circuit is designed to provide adaptive body biasing for a large number of equally matched devices within the chip, which may be useful in digital systems with many repetitive instances. Its functionality and effect will be tested by designing it to be used with a 13--stage inverter based ring oscillator operating at 65.5MHz, and observing the improvement in frequency variation across processing corners and a temperature range from -40 degrees Celsius to 80 degrees Celsius. All circuits were designed using a commercially available 28nm fdsoi transistor technology because of its excellent susceptibility to body biasing, and its promise as a competitive technology to continue Moore`s law. Results obtained by post--layout simulations on the ring oscillator show that frequency variations across processing corners and temperature has been reduced from 18.69% down to 0.632% by utilising adaptive body biasing provided by the compensation circuit. Ring oscillator frequency temperature sensitivity in a range from -40 degrees Celsius to 80 degrees Celsius for the typical corner is shown to be as little as 29.4ppm per degree Celsiu

    Voltage stacking for near/sub-threshold operation

    Get PDF

    Low energy digital circuits in advanced nanometer technologies

    Get PDF
    The demand for portable devices and the continuing trend towards the Internet ofThings (IoT) have made of energy consumption one of the main concerns in the industry and researchers. The most efficient way of reducing the energy consump-tion of digital circuits is decreasing the supply voltage (Vdd) since the dynamicenergy quadratically depends onVdd. Several works have shown that an optimumsupply voltage exists that minimizes the energy consumption of digital circuits. This optimum supply voltage is usually around 200 mV and 400 mV dependingon the circuit and technology used. To obtain these low supply voltages, on-chipdc-dc converters with high efficiency are needed.This thesis focuses on the study of subthreshold digital systems in advancednanometer technologies. These systems usually can be divided into a Power Man-agement Unit (PMU) and a digital circuit operating at the subthreshold regime.In particular, while considering the PMU, one of the key circuits is the dc-dcconverter. This block converts the voltage from the power source (battery, supercapacitor or wireless power transfer link) to a voltage between 200 mV and 400mV in order to power the digital circuit. In this thesis, we developed two chargerecycling techniques in order to improve the efficiency of switched capacitors dc-dcconverters. The first one is based on a technique used in adiabatic circuits calledstepwise charging. This technique was used in circuits and applications wherethe switching consumption of a big capacitance is very important. We analyzedthe possibility of using this technique in switched capacitor dc-dc converters withintegrated capacitors. We showed through measurements that a 29% reductionin the gate drive losses can be obtained with this technique. The second one isa simplification of stepwise charging which can be applied in some architecturesof switched capacitors dc-dc converters. We also fabricated and tested a dc-dcconverter with this technique and obtained a 25% energy reduction in the drivingof the switches that implement the converter.Furthermore, we studied the digital circuit working in the subthreshold regime,in particular, operating at the minimum energy point. We studied different modelsfor circuits working in these conditions and improved them by considering thedifferences between the NMOS and PMOS transistors. We obtained an optimumNMOS/PMOS leakage current imbalance that minimizes the total leakage energy per operation. This optimum depends on the architecture of the digital circuitand the input data. However, we also showed that important energy reductionscan be obtained by operating at a mean optimum imbalance. We proposed two techniques to achieve the optimum imbalance. We used aFully Depleted Silicon on Insulator (FD-SOI) 28 nm technology for most of the simulations, but we also show that these techniques can be applied in traditionalbulk CMOS technologies. The first one consists in using the back plane voltage of the transistors (or bulk voltage in traditional CMOS) to adjust independently theleakage current of the NMOS and PMOS transistor to work under the optimum NMOS/PMOS leakage current imbalance. We called this approach the OptimumBack Plane Biasing (OBB). A second technique consists of using the length of the transistors to adjust this leakage current imbalance. In the subthreshold regimeand in advanced nanometer technologies a moderate increase in the length has little impact in the output capacitance of the gates and thus in the dynamic energy.We called this approach an Asymmetric Length Biasing (ALB). Finally, we use these techniques in some basic circuits such as adders. We show that around 50% energy reduction can be obtained, in a wide range of frequency while working near the minimum energy point and using these techniques. The main contributions of this thesis are: • Analysis of the stepwise charging technique in small capacitances. •Implementation of stepwise charging technique as a charge recycling tech-nique for efficiency improvement in switched capacitor dc-dc converters. • Development of a charge sharing technique for efficiency improvement inswitched capacitor dc-dc converters. • Analysis of minimum operating voltage of digital circuits due to intrinsicnoise and the impact of technology scaling in this minimum. • Improvement in the modeling of the minimum energy point while considering NMOS and PMOS transistors difference. • Demonstration of the existence of an optimum leakage current imbalance be-tween the NMOS and PMOS transistors that minimizes energy consumptionin the subthreshold regiion. • Development of a back plane (bulk) voltage strategy for working in this optimum.• Development of a sizing strategy for working in the aforementioned optimum. • Analysis of the impact of architecture and input data on the optimum im-balance. The thesis is based on the publications [1–8]. During the Ph.D. program, other publications were generated [9–16] that are partially related with the thesis butwere not included in it.La constante demanda de dispositivos portables y los avances hacia la Internet de las Cosas han hecho del consumo de energía uno de los mayores desafíos y preocupación en la industria y la academia. La forma más eficiente de reducir el consumo de energía de los circuitos digitales es reduciendo su voltaje de alimentación ya que la energía dinámica depende de manera cuadrática con dicho voltaje. Varios trabajos demostraron que existe un voltaje de alimentación óptimo, que minimiza la energía consumida para realizar cierta operación en un circuito digital, llamado punto de mínima energía. Este óptimo voltaje se encuentra usualmente entre 200 mV y 400 mV dependiendo del circuito y de la tecnología utilizada. Para obtener estos voltajes de alimentación de la fuente de energía, se necesitan conversores dc-dc integrados con alta eficiencia. Esta tesis se concentra en el estudio de sistemas digitales trabajando en la región sub umbral diseñados en tecnologías nanométricas avanzadas (28 nm). Estos sistemas se pueden dividir usualmente en dos bloques, uno llamado bloque de manejo de potencia, y el segundo, el circuito digital operando en la region sub umbral. En particular, en lo que corresponde al bloque de manejo de potencia, el circuito más crítico es en general el conversor dc-dc. Este circuito convierte el voltaje de una batería (o super capacitor o enlace de transferencia inalámbrica de energía o unidad de cosechado de energía) en un voltaje entre 200 mV y 400 mV para alimentar el circuito digital en su voltaje óptimo. En esta tesis desarrollamos dos técnicas que, mediante el reciclado de carga, mejoran la eficiencia de los conversores dc-dc a capacitores conmutados. La primera es basada en una técnica utilizada en circuitos adiabáticos que se llama carga gradual o a pasos. Esta técnica se ha utilizado en circuitos y aplicaciones en donde el consumo por la carga y descarga de una capacidad grande es dominante. Nosotros analizamos la posibilidad de utilizar esta técnica en conversores dc-dc a capacitores conmutados con capacitores integrados. Se demostró a través de medidas que se puede reducir en un 29% el consumo debido al encendido y apagado de las llaves que implementan el conversor dc-dc. La segunda técnica, es una simplificación de la primera, la cual puede ser aplicada en ciertas arquitecturas de conversores dc-dc a capacitores conmutados. También se fabricó y midió un conversor con esta técnica y se obtuvo una reducción del 25% en la energía consumida por el manejo de las llaves del conversor. Por otro lado, estudiamos los circuitos digitales operando en la región sub umbral y en particular cerca del punto de mínima energía. Estudiamos diferentes modelos para circuitos operando en estas condiciones y los mejoramos considerando las diferencias entre los transistores NMOS y PMOS. Mediante este modelo demostramos que existe un óptimo en la relación entre las corrientes de fuga de ambos transistores que minimiza la energía de fuga consumida por operación. Este óptimo depende de la arquitectura del circuito digital y ademas de los datos de entrada del circuito. Sin embargo, demostramos que se puede reducir el consumo de manera considerable al operar en un óptimo promedio. Propusimos dos técnicas para alcanzar la relación óptima. Utilizamos una tecnología FD-SOI de 28nm para la mayoría de las simulaciones, pero también mostramos que estas técnicas pueden ser utilizadas en tecnologías bulk convencionales. La primer técnica, consiste en utilizar el voltaje de la puerta trasera (o sustrato en CMOS convencional) para ajustar de manera independiente las corrientes del NMOS y PMOS para que el circuito trabaje en el óptimo de la relación de corrientes. Esta técnica la llamamos polarización de voltaje de puerta trasera óptimo. La segunda técnica, consiste en utilizar los largos de los transistores para ajustar las corrientes de fugas de cada transistor y obtener la relación óptima. Trabajando en la región sub umbral y en tecnologías avanzadas, incrementar moderadamente el largo del transistor tiene poco impacto en la energía dinámica y es por eso que se puede utilizar. Finalmente, utilizamos estas técnicas en circuitos básicos como sumadores y mostramos que se puede obtener una reducción de la energía consumida de aproximadamente 50%, en un amplio rango de frecuencias, mientras estos circuitos trabajan cerca del punto de energía mínima. Las principales contribuciones de la tesis son: • Análisis de la técnica de carga gradual o a pasos en capacidades pequeñas. • Implementación de la técnica de carga gradual para la mejora de eficiencia de conversores dc-dc a capacitores conmutados. • Simplificación de la técnica de carga gradual para mejora de la eficiencia en algunas arquitecturas de conversores dc-dc de capacitores conmutados. • Análisis del mínimo voltaje de operación en circuitos digitales debido al ruido intrínseco del dispositivo y el impacto del escalado de las tecnologías en el mismo. • Mejoras en el modelado del punto de energía mínima de operación de un circuito digital en el cual se consideran las diferencias entre el transistor PMOS y NMOS. • Demostración de la existencia de un óptimo en la relación entre las corrientes de fuga entre el NMOS y PMOS que minimiza la energía de fugas consumida en la región sub umbral. • Desarrollo de una estrategia de polarización del voltaje de puerta trasera para que el circuito digital trabaje en el óptimo antes mencionado. • Desarrollo de una estrategia para el dimensionado de los transistores que componen las compuertas digitales que permite al circuito digital operar en el óptimo antes mencionado. • Análisis del impacto de la arquitectura del circuito y de los datos de entrada del mismo en el óptimo antes mencionado

    Bascules à impulsion robustes en technologie 28nm FDSOI pour circuits numériques basse consommation à très large gamme de tension d'alimentation

    Get PDF
    The explosion market of the mobile application and the paradigm of the Internet of Things lead to a huge demand for energy-efficient systems. To overcome the limit of Moore's law due to bulk technology, a new transistor technology has appeared recently in industrial process: the fully-depleted silicon on insulator, or FDSOI.In modern ASIC designs, a large portion of the total power consumption is due to the leaves of the clock tree: the flip-flops. Therefore, the appropriate flip-flop architecture is a major choice to reach the speed and energy constraints of mobile and ultra-low power applications. After a thorough overview of the literature, the explicit pulse-triggered flip-flop topology is pointed out as a very interesting flip-flop architecture for high-speed and low-power systems. However, it is today only used in high-performances circuits mainly because of its poor robustness at ultra-low voltage.In this work, explicit pulse-triggered flip-flops architecture design is developed and studied in order to improve their robustness and their energy-efficiency. A large comparison of resettable and scannable latch architecture is performed in the energy-delay domain by modifying the sizing of the transistors, both at nominal and ultra-low voltage. Then, it is shown that the back biasing technique allowed by the FDSOI technology provides better energy and delay performances than the sizing methodology. As the pulse generator is the main cause of functional failure, we proposed a new architecture which provides both a good robustness at ultra-low voltage and an energy efficiency. A selected topology of explicit pulse-triggered flip-flop was implemented in a 16x32b register file which exhibits better speed, energy consumption and area performances than a version with master-slave flip-flops, mainly thanks to the sharing of the pulse generator over several latches.Avec l'explosion du marché des applications portables et le paradigme de l'Internet des objets, la demande pour les circuits à très haute efficacité énergétique ne cesse de croître. Afin de repousser les limites de la loi de Moore, une nouvelle technologie est apparue très récemment dans les procédés industriels afin de remplacer la technologie en substrat massif ; elle est nommée fully-depleted silicon on insulator ou FDSOI. Dans les circuits numériques synchrones modernes, une grande portion de la consommation totale du circuit provient de l'arbre d'horloge, et en particulier son extrémité : les bascules. Dès lors, l'architecture adéquate de bascules est un choix crucial pour atteindre les contraintes de vitesse et d'énergie des applications basse-consommation. Après un large aperçu de l'état de l'art, les bascules à impulsion explicite sont reconnues les plus prometteuses pour les systèmes demandant une haute performance et une basse consommation. Cependant, cette architecture est pour l'instant fortement utilisée dans les circuits à haute performance et pratiquement absente des circuits à basse tension d'alimentation, principalement à cause de sa faible robustesse face aux variations.Dans ce travail, la conception d'architecture de bascule à impulsion explicite est étudiée dans le but d'améliorer la robustesse et l'efficacité énergétique. Un large panel d'architectures de bascule, avec les fonctions reset et scan, a été comparé dans le domaine énergie-délais, à haute et basse tension d'alimentation, grâce à une méthodologie de dimensionnement des transistors. Il a été montré que la technique dite de « back bias », l'un des principaux avantages de la technologie FDSOI, permettait des meilleures performances en énergie et délais que la méthodologie de dimensionnement. Ensuite, comme le générateur d'impulsion est la principale raison de dysfonctionnement, nous avons proposé une nouvelle architecture qui permet un très bon compromis entre robustesse à faible tension et consommation énergétique. Une topologie de bascule à impulsion explicite a été choisie pour être implémentée dans un banc de registres et, comparé aux bascules maître-esclave, elle présente une plus grande vitesse, une plus faible consommation énergétique et une plus petite surface

    Etude de la variabilité en technologie FDSOI : du transistor aux cellules mémoires SRAM

    No full text
    The scaling of bulk MOSFETs transistors is facing various difficulties at the nanometer era. The variability of the electrical characteristics becomes a major challenge which increases as the device dimensions are scaled down. Fully-Depleted Silicon On Insulator (FDSOI) technology, developed as an alternative to bulk transistors, exhibits a better electrostatic immunity which enables higher performances. Moreover, the reduction of the Random Dopant Fluctuation allows excellent variability immunity for the FDSOI technology due to its undoped channel. It leads to a yield enhancement and a reduction of the minimum supply voltage of SRAM circuits. The variability has been analyzed deeply during this thesis in this technology, both on the threshold voltage (VT) and on the ON-state current (ISAT). The correlation between the electrical characteristics of MOSFETs devices (i.e., the threshold voltage and the standard deviation σVT) and SRAM cells (i.e., the SNM and σSNM) has been investigated thanks to an extensive experimental study and modeling. This purpose of this thesis is also to analyze the specific FDSOI variability source: silicon thickness fluctuations. An analytical model has been developed in order to quantify the impact of local TSi variations on the VT variability for 28 and 20nm technology nodes, as well as on a 200Mb SRAM array. This model also enables to evaluate the silicon thickness mean (µTsi) and standard deviation (σTsi) specifications for next technology nodes.La miniaturisation des transistors MOSFETs sur silicium massif présente de nombreux enjeux en raison de l'apparition de phénomènes parasites. Notamment, la réduction de la surface des dispositifs entraîne une dégradation de la variabilité de leurs caractéristiques électriques. La technologie planaire totalement désertée, appelée communément FDSOI (pour Fully Depleted Silicon on Insulator), permet d'améliorer le contrôle électrostatique de la grille sur le canal de conduction et par conséquent d'optimiser les performances. De plus, de par la présence d'un canal non dopé, il est possible de réduire efficacement la variabilité de la tension de seuil des transistors. Cela se traduit par un meilleur rendement et par une diminution de la tension minimale d'alimentation des circuits SRAM (pour Static Random Access Memory). Une étude détaillée de la variabilité intrinsèque à cette technologie a été réalisée durant ce travail de recherche, aussi bien sur la tension de seuil (VT) que sur le courant de drain à l'état passant (ISAT). De plus, le lien existant entre la fluctuation des caractéristiques électriques des transistors et des circuits SRAM a été expérimentalement analysé en détail. Une large partie de cette thèse est enfin dédiée à l'investigation de la source de variabilité spécifique à la technologie FDSOI : les fluctuations de l'épaisseur du film de silicium. Un modèle analytique a été développé durant cette thèse afin d'étudier l'influence des fluctuations locales de TSi sur la variabilité de la tension de seuil des transistors pour les nœuds technologiques 28 et 20nm, ainsi que sur un circuit SRAM de 200Mb. Ce modèle a également pour but de fournir des spécifications en termes d'uniformité σTsi et d'épaisseur moyenne µTsi du film de silicium pour les prochains nœuds technologiques

    A fully integrated 2:1 self-oscillating switched-capacitor DC-DC converter in 28 nm UTBB FD-SOI

    Get PDF
    The importance of energy-constrained processors continues to grow especially for ultra-portable sensor-based platforms for the Internet-of-Things (IoT). Processors for these IoT applications primarily operate at near-threshold (NT) voltages and have multiple power modes. Achieving high conversion efficiency within the DC–DC converter that supplies these processors is critical since energy consumption of the DC–DC/processor system is proportional to the DC–DC converter efficiency. The DC–DC converter must maintain high efficiency over a large load range generated from the multiple power modes of the processor. This paper presents a fully integrated step-down self-oscillating switched-capacitor DC–DC converter that is capable of meeting these challenges. The area of the converter is 0.0104 mm2 and is designed in 28 nm ultra-thin body and buried oxide fully-depleted SOI (UTBB FD-SOI). Back-gate biasing within FD-SOI is utilized to increase the load power range of the converter. With an input of 1 V and output of 460 mV, measurements of the converter show a minimum efficiency of 75% for 79 nW to 200 µW loads. Measurements with an off-chip NT processor load show efficiency up to 86%. The converter’s large load power range and high efficiency make it an excellent fit for energy-constrained processors.</p

    Neurostream: Scalable and Energy Efficient Deep Learning with Smart Memory Cubes

    Get PDF
    open4siHigh-performance computing systems are moving towards 2.5D and 3D memory hierarchies, based on High Bandwidth Memory (HBM) and Hybrid Memory Cube (HMC) to mitigate the main memory bottlenecks. This trend is also creating new opportunities to revisit near-memory computation. In this paper, we propose a flexible processor-in-memory (PIM) solution for scalable and energy-efficient execution of deep convolutional networks (ConvNets), one of the fastest-growing workloads for servers and high-end embedded systems. Our co-design approach consists of a network of Smart Memory Cubes (modular extensions to the standard HMC) each augmented with a many-core PIM platform called NeuroCluster. NeuroClusters have a modular design based on NeuroStream coprocessors (for Convolution-intensive computations) and general-purpose RISC-V cores. In addition, a DRAM-friendly tiling mechanism and a scalable computation paradigm are presented to efficiently harness this computational capability with a very low programming effort. NeuroCluster occupies only 8 percent of the total logic-base (LoB) die area in a standard HMC and achieves an average performance of 240 GFLOPS for complete execution of full-featured state-of-the-art (SoA) ConvNets within a power budget of 2.5 W. Overall 11 W is consumed in a single SMC device, with 22.5 GFLOPS/W energy-efficiency which is 3.5X better than the best GPU implementations in similar technologies. The minor increase in system-level power and the negligible area increase make our PIM system a cost-effective and energy efficient solution, easily scalable to 955 GFLOPS with a small network of just four SMCs.openAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, LucaAzarkhish, Erfan*; Rossi, Davide; Loi, Igor; Benini, Luc
    corecore