63 research outputs found

    An Experimental Study of Reduced-Voltage Operation in Modern FPGAs for Neural Network Acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect of environmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.Comment: To appear at the DSN 2020 conferenc

    Pathfinding Future PIM Architectures by Demystifying a Commercial PIM Technology

    Full text link
    Processing-in-memory (PIM) has been explored for decades by computer architects, yet it has never seen the light of day in real-world products due to their high design overheads and lack of a killer application. With the advent of critical memory-intensive workloads, several commercial PIM technologies have been introduced to the market ranging from domain-specific PIM architectures to more general-purpose PIM architectures. In this work, we deepdive into UPMEM's commercial PIM technology, a general-purpose PIM-enabled parallel architecture that is highly programmable. Our first key contribution is the development of a flexible simulation framework for PIM. The simulator we developed (aka PIMulator) enables the compilation of UPMEM-PIM source codes into its compiled machine-level instructions, which are subsequently consumed by our cycle-level performance simulator. Using PIMulator, we demystify UPMEM's PIM design through a detailed characterization study. Building on top of our characterization, we conduct a series of case studies to pathfind important architectural features that we deem will be critical for future PIM architectures to suppor

    High-Performance Energy-Efficient and Reliable Design of Spin-Transfer Torque Magnetic Memory

    Get PDF
    In this dissertation new computing paradigms, architectures and design philosophy are proposed and evaluated for adopting the STT-MRAM technology as highly reliable, energy efficient and fast memory. For this purpose, a novel cross-layer framework from the cell-level all the way up to the system- and application-level has been developed. In these framework, the reliability issues are modeled accurately with appropriate fault models at different abstraction levels in order to analyze the overall failure rates of the entire memory and its Mean Time To Failure (MTTF) along with considering the temperature and process variation effects. Design-time, compile-time and run-time solutions have been provided to address the challenges associated with STT-MRAM. The effectiveness of the proposed solutions is demonstrated in extensive experiments that show significant improvements in comparison to state-of-the-art solutions, i.e. lower-power, higher-performance and more reliable STT-MRAM design

    A Compiler Target Model for Line Associative Registers

    Get PDF
    LARs (Line Associative Registers) are very wide tagged registers, used for both register-wide SWAR (SIMD Within a Register )operations and scalar operations on arbitrary fields. LARs include a large data field, type tags, source addresses, and a dirty bit, which allow them to not only replace both caches and registers in the conventional memory hierarchy, but improve on both their functions. This thesis details a LAR-based architecture, and describes the design of a compiler which can generate code for a LAR-based design. In particular, type conversion, alignment, and register allocation are discussed in detail

    Vers la Compression à Tous les Niveaux de la Hiérarchie de la Mémoire

    Get PDF
    Hardware compression techniques are typically simplifications of software compression methods. They must, however, comply with area, power and latency constraints. This study unveils the challenges of adopting compression in memory design. The goal of this analysis is not to summarize proposals, but to put in evidence the solutions they employ to handle those challenges. An in-depth description of the main characteristics of multiple methods is provided, as well as criteria that can be used as a basis for the assessment of such schemes.Typically, these schemes are not very efficient, and those that do compress well decompress slowly. This work explores their granularity to redefine their perspectives and improve their efficiency, through a concept called Region-Chunk compression. Its goal is to achieve low (good) compression ratio and fast decompression latency. The key observation is that by further sub-dividing the chunks of data being compressed one can reduce data duplication. This concept can be applied to several previously proposed compressors, resulting in a reduction of their average compressed size. In particular, a single-cycle-decompression compressor is boosted to reach a compressibility level competitive to state-of-the-art proposals.Finally, to increase the probability of successfully co-allocating compressed lines, Pairwise Space Sharing (PSS) is proposed. PSS can be applied orthogonally to compaction methods at no extra latency penalty, and with a cost-effective metadata overhead. The proposed system (Region-Chunk+PSS) further enhances the normalized average cache capacity by 2.7% (geometric mean), while featuring short decompression latency.Les techniques de compression matérielle sont généralement des simplifications des méthodes de compression logicielle. Elles doivent, toutefois, se conformer aux contraintes de surface, de puissance et de latence. Cette étude dévoile les défis de l’adoption de la compression dans la conception de la mémoire. Le but de l’analyse n’est pas de résumer les propositions, mais de mettre en évidence les solutions qu’ils emploient pour relever ces défis. Une description détaillée des principales caractéristiques de plusieurs méthodes est fournie, ainsi que des critères qui peuvent être utilisés comme base pour l’évaluation de ces systèmes.Généralement, ces schémas ne sont pas très efficaces, et les schémas qui compressent bien décompressent lentement. Ce travail explore leur granularité pour redéfinir leurs perspectives et améliorer leur efficacité, à travers un concept appelé compression Region-Chunk. Son objectif est d’obtenir un haut (bon) taux de compression et une latence de décompression rapide. L’observation clé est qu’en subdivisant davantage les blocs de données compressés, on peut réduire la duplication des données. Ce concept peut être appliqué à plusieurs compresseurs précédemment proposés, entraînant une réduction de leur taille moyenne compressée. En particulier, un compresseur à décompression à cycle unique est boosté pour atteindre un niveau de compressibilité compétitif par rapport aux propositions de pointe.Enfin, pour augmenter la probabilité de co-allouer avec succès des lignes compressées, Pairwise Space Sharing (PSS) est proposé. PSS peutêtre appliqué orthogonalement aux méthodes de compactage sans pénalité de latence supplémentaire, et avec une surcharge de métadonnées rentable. Le système proposé (Region-Chunk + PSS) améliore encore la capacité normalisé moyenne du cache de 2,7% (moyenne géométrique), tout en offrant une courte latence de décompression

    An experimental study of reduced-voltage operation in modern FPGAs for neural network acceleration

    Get PDF
    We empirically evaluate an undervolting technique, i.e., underscaling the circuit supply voltage below the nominal level, to improve the power-efficiency of Convolutional Neural Network (CNN) accelerators mapped to Field Programmable Gate Arrays (FPGAs). Undervolting below a safe voltage level can lead to timing faults due to excessive circuit latency increase. We evaluate the reliability-power trade-off for such accelerators. Specifically, we experimentally study the reduced-voltage operation of multiple components of real FPGAs, characterize the corresponding reliability behavior of CNN accelerators, propose techniques to minimize the drawbacks of reduced-voltage operation, and combine undervolting with architectural CNN optimization techniques, i.e., quantization and pruning. We investigate the effect ofenvironmental temperature on the reliability-power trade-off of such accelerators. We perform experiments on three identical samples of modern Xilinx ZCU102 FPGA platforms with five state-of-the-art image classification CNN benchmarks. This approach allows us to study the effects of our undervolting technique for both software and hardware variability. We achieve more than 3X power-efficiency (GOPs/W ) gain via undervolting. 2.6X of this gain is the result of eliminating the voltage guardband region, i.e., the safe voltage region below the nominal level that is set by FPGA vendor to ensure correct functionality in worst-case environmental and circuit conditions. 43% of the power-efficiency gain is due to further undervolting below the guardband, which comes at the cost of accuracy loss in the CNN accelerator. We evaluate an effective frequency underscaling technique that prevents this accuracy loss, and find that it reduces the power-efficiency gain from 43% to 25%.The work done for this paper was partially supported by a HiPEAC Collaboration Grant funded by the H2020 HiPEAC Project under grant agreement No. 779656. The research leading to these results has received funding from the European Union’s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement No. 780681.Peer ReviewedPostprint (author's final draft
    • …
    corecore