14 research outputs found

    Vector processing-aware advanced clock-gating techniques for low-power fused multiply-add

    Get PDF
    The need for power efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a retailoring for the mobile market that they are entering now. Floating-point (FP) fused multiply-add (FMA), being a functional unit with high power consumption, deserves special attention. Although clock gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector FMA units (VFUs). These techniques ensure power savings without jeopardizing the timing. We evaluate the proposed techniques using both synthetic and “real-world” application-based benchmarking. Using vector masking and vector multilane-aware clock gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector FP instructions. Finally, when evaluating all techniques together, using “real-world” benchmarking, the power reductions are up to 80%. Additionally, in accordance with processor design trends, we perform this research in a fully parameterizable and automated fashion.The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA 321253 and is supported in part by the European Union (FEDER funds) under contract TTIN2015-65316-P. The work of I. Ratkovic was supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (author's final draft

    Adaptable register file organization for vector processors

    Get PDF
    Today there are two main vector processors design trends. On the one hand, we have vector processors designed for long vectors lengths such as the SX-Aurora TSUBASA which implements vector lengths of 256 elements (16384-bits). On the other hand, we have vector processors designed for short vectors such as the Fujitsu A64FX that implements vector lengths of 8 elements (512-bit) ARM SVE. However, short vector designs are the most widely adopted in modern chips. This is because, to achieve high-performance with a very high-efficiency, applications executed on long vector designs must feature abundant DLP, then limiting the range of applications. On the contrary, short vector designs are compatible with a larger range of applications. In fact, in the beginnings, long vector length implementations were focused on the HPC market, while short vector length implementations were conceived to improve performance in multimedia tasks. However, those short vector length extensions have evolved to better fit the needs of modern applications. In that sense, we believe that this compatibility with a large range of applications featuring high, medium and low DLP is one of the main reasons behind the trend of building parallel machines with short vectors. Short vector designs are area efficient and are "compatible" with applications having long vectors; however, these short vector architectures are not as efficient as longer vector designs when executing high DLP code. In this thesis, we propose a novel vector architecture that combines the area and resource efficiency characterizing short vector processors with the ability to handle large DLP applications, as allowed in long vector architectures. In this context, we present AVA, an Adaptable Vector Architecture designed for short vectors (MVL = 16 elements), capable of reconfiguring the MVL when executing applications with abundant DLP, achieving performance comparable to designs for long vectors. The design is based on three complementary concepts. First, a two-stage renaming unit based on a new type of registers termed as Virtual Vector Registers (VVRs), which are an intermediate mapping between the conventional logical and the physical and memory registers. In the first stage, logical registers are renamed to VVRs, while in the second stage, VVRs are renamed to physical registers. Second, a two-level VRF, that supports 64 VVRs whose MVL can be configured from 16 to 128 elements. The first level corresponds to the VVRs mapped in the physical registers held in the 8KB Physical Vector Register File (P-VRF), while the second level represents the VVRs mapped in memory registers held in the Memory Vector Register File (M-VRF). While the baseline configuration (MVL=16 elements) holds all the VVRs in the P-VRF, larger MVL configurations hold a subset of the total VVRs in the P-VRF, and map the remaining part in the M-VRF. Third, we propose a novel two-stage vector issue unit. In the first stage, the second level of mapping between the VVRs and physical registers is performed, while issuing to execute is managed in the second stage. This thesis also presents a set of tools for designing and evaluating vector architectures. First, a parameterizable vector architecture model implemented on the gem5 simulator to evaluate novel ideas on vector architectures. Second, a Vector Architecture model implemented on the McPAT framework to evaluate power and area metrics. Finally, the RiVEC benchmark suite, a collection of ten vectorized applications from different domains focusing on benchmarking vector microarchitectures.Hoy en día existen dos tendencias principales en el diseño de procesadores vectoriales. Por un lado, tenemos procesadores vectoriales basados en vectores largos como el SX-Aurora TSUBASA que implementa vectores con 256 elementos (16384-bits) de longitud. Por otro lado, tenemos procesadores vectoriales basados en vectores cortos como el Fujitsu A64FX que implementa vectores de 8 elementos (512-bits) de longitud ARM SVE. Sin embargo, los diseños de vectores cortos son los más adoptados en los chips modernos. Esto es porque, para lograr alto rendimiento con muy alta eficiencia, las aplicaciones ejecutadas en diseños de vectores largos deben presentar abundante paralelismo a nivel de datos (DLP), lo que limita el rango de aplicaciones. Por el contrario, los diseños de vectores cortos son compatibles con un rango más amplio de aplicaciones. En sus orígenes, implementaciones basadas en vectores largos estaban enfocadas al HPC, mientras que las implementaciones basadas en vectores cortos estaban enfocadas en tareas de multimedia. Sin embargo, esas extensiones basadas en vectores cortos han evolucionado para adaptarse mejor a las necesidades de las aplicaciones modernas. Creemos que esta compatibilidad con un mayor rango de aplicaciones es una de las principales razones de construir máquinas paralelas basadas en vectores cortos. Los diseños de vectores cortos son eficientes en área y son compatibles con aplicaciones que soportan vectores largos; sin embargo, estos diseños de vectores cortos no son tan eficientes como los diseños de vectores largos cuando se ejecuta un código con alto DLP. En esta tesis, proponemos una novedosa arquitectura vectorial que combina la eficiencia de área y recursos que caracteriza a los procesadores vectoriales basados en vectores cortos, con la capacidad de mejorar en rendimiento cuando se presentan aplicaciones con alto DLP, como lo permiten las arquitecturas vectoriales basadas en vectores largos. En este contexto, presentamos AVA, una Arquitectura Vectorial Adaptable basada en vectores cortos (MVL = 16 elementos), capaz de reconfigurar el MVL al ejecutar aplicaciones con abundante DLP, logrando un rendimiento comparable a diseños basados en vectores largos. El diseño se basa en tres conceptos. Primero, una unidad de renombrado de dos etapas basada en un nuevo tipo de registros denominados registros vectoriales virtuales (VVR), que son un mapeo intermedio entre los registros lógicos y físicos y de memoria. En la primera etapa, los registros lógicos se renombran a VVR, mientras que, en la segunda etapa, los VVR se renombran a registros físicos. En segundo lugar, un VRF de dos niveles, que admite 64 VVR cuyo MVL se puede configurar de 16 a 128 elementos. El primer nivel corresponde a los VVR mapeados en los registros físicos contenidos en el banco de registros vectoriales físico (P-VRF) de 8 KB, mientras que el segundo nivel representa los VVR mapeados en los registros de memoria contenidos en el banco de registros vectoriales de memoria (M-VRF). Mientras que la configuración de referencia (MVL=16 elementos) contiene todos los VVR en el P-VRF, las configuraciones de MVL más largos contienen un subconjunto del total de VVR en el P-VRF y mapean la parte restante en el M-VRF. En tercer lugar, proponemos una novedosa unidad de colas de emisión de dos etapas. En la primera etapa se realiza el segundo nivel de mapeo entre los VVR y los registros físicos, mientras que en la segunda etapa se gestiona la emisión de instrucciones a ejecutar. Esta tesis también presenta un conjunto de herramientas para diseñar y evaluar arquitecturas vectoriales. Primero, un modelo de arquitectura vectorial parametrizable implementado en el simulador gem5 para evaluar novedosas ideas. Segundo, un modelo de arquitectura vectorial implementado en McPAT para evaluar las métricas de potencia y área. Finalmente, presentamos RiVEC, una colección de diez aplicaciones vectorizadas enfocadas en evaluar arquitecturas vectorialesPostprint (published version

    Vector processor virtualization: distributed memory hierarchy and simultaneous multithreading

    Get PDF
    Taking advantage of DLP (Data-Level Parallelism) is indispensable in most data streaming and multimedia applications. Several architectures have been proposed to improve both the performance and energy consumption for such applications. Superscalar and VLIW (Very Long Instruction Word) processors, along with SIMD (Single-Instruction Multiple-Data) and vector processor (VP) accelerators, are among the available options for designers to accomplish their desired requirements. On the other hand, these choices turn out to be large resource and energy consumers, while also not being always used efficiently due to data dependencies among instructions and limited portion of vectorizable code in single applications that deploy them. This dissertation proposes an innovative architecture for a multithreaded VP which separates the path for performing data shuffle and memory-indexed accesses from the data path for executing other vector instructions that access the memory. This separation speeds up the most common memory access operations by avoiding extra delays and unnecessary stalls. In this multilane-based VP design, each vector lane uses its own private memory to avoid any stalls during memory access instructions. More importantly, the proposed VP has an innovative multithreaded architecture which makes it highly suitable for concurrent sharing in multicore environments. To this end, the VP which is developed in VHDL and prototyped on an FPGA (Field-Programmable Gate Array), serves as a coprocessor for one or more scalar cores in various system architectures presented in the dissertation. In the first system architecture, the VP is allocated exclusively to a single scalar core. Benchmarking shows that the VP can achieve very high performance. The inclusion of distributed data shuffle engines across vector lanes has a spectacular impact on the execution time, primarily for applications like FFT (Fast-Fourier Transform) that require large amounts of data shuffling. In the second system architecture, a VP virtualization technique is presented which, when applied, enables the multithreaded VP to simultaneously execute many threads of various vector lengths. The threads compete simultaneously for the VP resources having as a goal an improved aggregate VP utilization. This approach yields high VP utilization even under low utilization for the individual threads. A vector register file (VRF) virtualization technique dynamically allocates physical vector registers to running threads. The technique is implemented for a multi-core processor embedded in an FPGA. Under the dynamic creation of threads, benchmarking demonstrates large VP speedups and drastic energy savings when compared to the first system architecture. In the last system architecture, further improvements focus on VP virtualization relying exclusively on hardware. Moreover, a pipelined data shuffle network replaces the non-pipelined shuffle engines. The VP can then take advantage of identical instruction flows that may be present in different vector applications by running in a fused instruction mode that increases its utilization. A power dissipation model is introduced as well as two optimization policies towards minimizing the consumed energy, or the product of the energy and runtime for a given application. Benchmarking shows the positive impact of these optimizations

    Energy-aware synthesis for networks on chip architectures

    Full text link
    The Network on Chip (NoC) paradigm was introduced as a scalable communication infrastructure for future System-on-Chip applications. Designing application specific customized communication architectures is critical for obtaining low power, high performance solutions. Two significant design automation problems are the creation of an optimized configuration, given application requirement the implementation of this on-chip network. Automating the design of on-chip networks requires models for estimating area and energy, algorithms to effectively explore the design space and network component libraries and tools to generate the hardware description. Chip architects are faced with managing a wide range of customization options for individual components, routers and topology. As energy is of paramount importance, the effectiveness of any custom NoC generation approach lies in the availability of good energy models to effectively explore the design space. This thesis describes a complete NoC synthesis flow, called NoCGEN, for creating energy-efficient custom NoC architectures. Three major automation problems are addressed: custom topology generation, energy modeling and generation. An iterative algorithm is proposed to generate application specific point-to-point and packet-switched networks. The algorithm explores the design space for efficient topologies using characterized models and a system-level floorplanner for evaluating placement and wire-energy. Prior to our contribution, building an energy model required careful analysis of transistor or gate implementations. To alleviate the burden, an automated linear regression-based methodology is proposed to rapidly extract energy models for many router designs. The resulting models are cycle accurate with low-complexity and found to be within 10% of gate-level energy simulations, and execute several orders of magnitude faster than gate-level simulations. A hardware description of the custom topology is generated using a parameterizable library and custom HDL generator. Fully reusable and scalable network components (switches, crossbars, arbiters, routing algorithms) are described using a template approach and are used to compose arbitrary topologies. A methodology for building and composing routers and topologies using a template engine is described. The entire flow is implemented as several demonstrable extensible tools with powerful visualization functionality. Several experiments are performed to demonstrate the design space exploration capabilities and compare it against a competing min-cut topology generation algorithm

    The Customizable Virtual FPGA: Generation, System Integration and Configuration of Application-Specific Heterogeneous FPGA Architectures

    Get PDF
    In den vergangenen drei Jahrzehnten wurde die Entwicklung von Field Programmable Gate Arrays (FPGAs) stark von Moore’s Gesetz, Prozesstechnologie (Skalierung) und kommerziellen Märkten beeinflusst. State-of-the-Art FPGAs bewegen sich einerseits dem Allzweck näher, aber andererseits, da FPGAs immer mehr traditionelle Domänen der Anwendungsspezifischen integrierten Schaltungen (ASICs) ersetzt haben, steigen die Effizienzerwartungen. Mit dem Ende der Dennard-Skalierung können Effizienzsteigerungen nicht mehr auf Technologie-Skalierung allein zurückgreifen. Diese Facetten und Trends in Richtung rekonfigurierbarer System-on-Chips (SoCs) und neuen Low-Power-Anwendungen wie Cyber Physical Systems und Internet of Things erfordern eine bessere Anpassung der Ziel-FPGAs. Neben den Trends für den Mainstream-Einsatz von FPGAs in Produkten des täglichen Bedarfs und Services wird es vor allem bei den jüngsten Entwicklungen, FPGAs in Rechenzentren und Cloud-Services einzusetzen, notwendig sein, eine sofortige Portabilität von Applikationen über aktuelle und zukünftige FPGA-Geräte hinweg zu gewährleisten. In diesem Zusammenhang kann die Hardware-Virtualisierung ein nahtloses Mittel für Plattformunabhängigkeit und Portabilität sein. Ehrlich gesagt stehen die Zwecke der Anpassung und der Virtualisierung eigentlich in einem Konfliktfeld, da die Anpassung für die Effizienzsteigerung vorgesehen ist, während jedoch die Virtualisierung zusätzlichen Flächenaufwand hinzufügt. Die Virtualisierung profitiert aber nicht nur von der Anpassung, sondern fügt auch mehr Flexibilität hinzu, da die Architektur jederzeit verändert werden kann. Diese Besonderheit kann für adaptive Systeme ausgenutzt werden. Sowohl die Anpassung als auch die Virtualisierung von FPGA-Architekturen wurden in der Industrie bisher kaum adressiert. Trotz einiger existierenden akademischen Werke können diese Techniken noch als unerforscht betrachtet werden und sind aufstrebende Forschungsgebiete. Das Hauptziel dieser Arbeit ist die Generierung von FPGA-Architekturen, die auf eine effiziente Anpassung an die Applikation zugeschnitten sind. Im Gegensatz zum üblichen Ansatz mit kommerziellen FPGAs, bei denen die FPGA-Architektur als gegeben betrachtet wird und die Applikation auf die vorhandenen Ressourcen abgebildet wird, folgt diese Arbeit einem neuen Paradigma, in dem die Applikation oder Applikationsklasse fest steht und die Zielarchitektur auf die effiziente Anpassung an die Applikation zugeschnitten ist. Dies resultiert in angepassten anwendungsspezifischen FPGAs. Die drei Säulen dieser Arbeit sind die Aspekte der Virtualisierung, der Anpassung und des Frameworks. Das zentrale Element ist eine weitgehend parametrierbare virtuelle FPGA-Architektur, die V-FPGA genannt wird, wobei sie als primäres Ziel auf jeden kommerziellen FPGA abgebildet werden kann, während Anwendungen auf der virtuellen Schicht ausgeführt werden. Dies sorgt für Portabilität und Migration auch auf Bitstream-Ebene, da die Spezifikation der virtuellen Schicht bestehen bleibt, während die physische Plattform ausgetauscht werden kann. Darüber hinaus wird diese Technik genutzt, um eine dynamische und partielle Rekonfiguration auf Plattformen zu ermöglichen, die sie nicht nativ unterstützen. Neben der Virtualisierung soll die V-FPGA-Architektur auch als eingebettetes FPGA in ein ASIC integriert werden, das effiziente und dennoch flexible System-on-Chip-Lösungen bietet. Daher werden Zieltechnologie-Abbildungs-Methoden sowohl für Virtualisierung als auch für die physikalische Umsetzung adressiert und ein Beispiel für die physikalische Umsetzung in einem 45 nm Standardzellen Ansatz aufgezeigt. Die hochflexible V-FPGA-Architektur kann mit mehr als 20 Parametern angepasst werden, darunter LUT-Grösse, Clustering, 3D-Stacking, Routing-Struktur und vieles mehr. Die Auswirkungen der Parameter auf Fläche und Leistung der Architektur werden untersucht und eine umfangreiche Analyse von über 1400 Benchmarkläufen zeigt eine hohe Parameterempfindlichkeit bei Abweichungen bis zu ±95, 9% in der Fläche und ±78, 1% in der Leistung, was die hohe Bedeutung von Anpassung für Effizienz aufzeigt. Um die Parameter systematisch an die Bedürfnisse der Applikation anzupassen, wird eine parametrische Entwurfsraum-Explorationsmethode auf der Basis geeigneter Flächen- und Zeitmodellen vorgeschlagen. Eine Herausforderung von angepassten Architekturen ist der Entwurfsaufwand und die Notwendigkeit für angepasste Werkzeuge. Daher umfasst diese Arbeit ein Framework für die Architekturgenerierung, die Entwurfsraumexploration, die Anwendungsabbildung und die Evaluation. Vor allem ist der V-FPGA in einem vollständig synthetisierbaren generischen Very High Speed Integrated Circuit Hardware Description Language (VHDL) Code konzipiert, der sehr flexibel ist und die Notwendigkeit für externe Codegeneratoren eliminiert. Systementwickler können von verschiedenen Arten von generischen SoC-Architekturvorlagen profitieren, um die Entwicklungszeit zu reduzieren. Alle notwendigen Konstruktionsschritte für die Applikationsentwicklung und -abbildung auf den V-FPGA werden durch einen Tool-Flow für Entwurfsautomatisierung unterstützt, der eine Sammlung von vorhandenen kommerziellen und akademischen Werkzeugen ausnutzt, die durch geeignete Modelle angepasst und durch ein neues Werkzeug namens V-FPGA-Explorer ergänzt werden. Dieses neue Tool fungiert nicht nur als Back-End-Tool für die Anwendungsabbildung auf dem V-FPGA sondern ist auch ein grafischer Konfigurations- und Layout-Editor, ein Bitstream-Generator, ein Architekturdatei-Generator für die Place & Route Tools, ein Script-Generator und ein Testbenchgenerator. Eine Besonderheit ist die Unterstützung der Just-in-Time-Kompilierung mit schnellen Algorithmen für die In-System Anwendungsabbildung. Die Arbeit schliesst mit einigen Anwendungsfällen aus den Bereichen industrielle Prozessautomatisierung, medizinische Bildgebung, adaptive Systeme und Lehre ab, in denen der V-FPGA eingesetzt wird

    Sparsity in deep learning: Pruning and growth for efficient inference and training in neural networks

    Get PDF
    The growing energy and performance costs of deep learning have driven the community to reduce the size of neural networks by selectively pruning components. Similarly to their biological counterparts, sparse networks generalize just as well, sometimes even better than, the original dense networks. Sparsity promises to reduce the memory footprint of regular networks to fit mobile devices, as well as shorten training time for ever growing networks. In this paper, we survey prior work on sparsity in deep learning and provide an extensive tutorial of sparsification for both inference and training. We describe approaches to remove and add elements of neural networks, different training strategies to achieve model sparsity, and mechanisms to exploit sparsity in practice. Our work distills ideas from more than 300 research papers and provides guidance to practitioners who wish to utilize sparsity today, as well as to researchers whose goal is to push the frontier forward. We include the necessary background on mathematical methods in sparsification, describe phenomena such as early structure adaptation, the intricate relations between sparsity and the training process, and show techniques for achieving acceleration on real hardware. We also define a metric of pruned parameter efficiency that could serve as a baseline for comparison of different sparse networks. We close by speculating on how sparsity can improve future workloads and outline major open problems in the field

    A fully parameterizable low power design of vector fused multiply-add using active clock-gating techniques

    No full text
    The need for power-efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a re-tailoring for the mobile market that they are entering now. Floating point fused multiply-add, being a power consuming functional unit, deserves special attention. Although clock-gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector fused multiply-add units (VFU). These techniques ensure power savings without jeopardizing the timing. Using vector masking and vector multi-lane-aware clock-gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector floating-point instructions. We perform this research in a fully parameterizable and automated fashion using various tools at both architectural and circuit levels.The authors would like to thank to Borivoje Nikolic, Brian Richards, and Yunsup Lee for their useful advises and fruitful discussions. The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA no 321253 and is supported in part by the European Union (FEDER funds) under contract TIN2015-65316-P. Ivan Ratkovic is supported by a FPU research grant from the Spanish MECD.Peer ReviewedPostprint (published version

    A fully parameterizable low power design of vector fused multiply-add using active clock-gating techniques

    No full text
    The need for power-efficiency is driving a rethink of design decisions in processor architectures. While vector processors succeeded in the high-performance market in the past, they need a re-tailoring for the mobile market that they are entering now. Floating point fused multiply-add, being a power consuming functional unit, deserves special attention. Although clock-gating is a well-known method to reduce switching power in synchronous designs, there are unexplored opportunities for its application to vector processors, especially when considering active operating mode. In this research, we comprehensively identify, propose, and evaluate the most suitable clock-gating techniques for vector fused multiply-add units (VFU). These techniques ensure power savings without jeopardizing the timing. Using vector masking and vector multi-lane-aware clock-gating, we report power reductions of up to 52%, assuming active VFU operating at the peak performance. Among other findings, we observe that vector instruction-based clock-gating techniques achieve power savings for all vector floating-point instructions. We perform this research in a fully parameterizable and automated fashion using various tools at both architectural and circuit levels.The authors would like to thank to Borivoje Nikolic, Brian Richards, and Yunsup Lee for their useful advises and fruitful discussions. The research leading to these results has received funding from the RoMoL ERC Advanced Grant GA no 321253 and is supported in part by the European Union (FEDER funds) under contract TIN2015-65316-P. Ivan Ratkovic is supported by a FPU research grant from the Spanish MECD.Peer Reviewe

    GPU PERFORMANCE MODELLING AND OPTIMIZATION

    Get PDF
    Ph.DNUS-TU/E JOINT PH.D

    FPGA-based high-performance neural network acceleration

    Full text link
    In the last ten years, Artificial Intelligence through Deep Neural Networks (DNNs) has penetrated virtually every aspect of science, technology, and business. Advances are rapid with thousands of papers being published annually. Many types of DNNs have been and continue to be developed -- in this thesis, we address Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Graph Neural Networks (GNNs) -- each with a different set of target applications and implementation challenges. The overall problem for all of these Neural Networks (NNs) is that their target applications generally pose stringent constraints on latency and throughput, but also have strict accuracy requirements. Much research has therefore gone into all aspects of improving NN quality and performance: algorithms, code optimization, acceleration with GPUs, and acceleration with hardware, both dedicated ASICs and off-the-shelf FPGAs. In this thesis, we concentrate on the last of these approaches. There have been many previous efforts in creating hardware to accelerate NNs. The problem designers face is that optimal NN models typically have significant irregularities, making them hardware unfriendly. One commonly used approach is to train NN models to follow regular computation and data patterns. This approach, however, can hurt the models' accuracy or lead to models with non-negligible redundancies. This dissertation takes a different approach. Instead of regularizing the model, we create architectures friendly to irregular models. Our thesis is that high-accuracy and high-performance NN inference and training can be achieved by creating a series of novel irregularity-aware architectures for Field-Programmable Gate Arrays (FPGAs). In four different studies on four different NN types, we find that this approach results in speedups of 2.1x to 3255x compared with carefully selected prior art; for inference, there is no change in accuracy. The bulk of this dissertation revolves around these studies, the various workload balancing techniques, and the resulting NN acceleration architectures. In particular, we propose four different architectures to handle, respectively, data structure level, operation level, bit level, and model level irregularities. At the data structure level, we propose AWB-GCN, which uses runtime workload rebalancing to handle Sparse Matrices Multiplications (SpMM) on extremely sparse and unbalanced input. With GNN inference as a case study, AWB-GCN achieves over 90% system efficiency, guarantees efficient off-chip memory access, and provides considerable speedups over CPUs (3255x), GPUs (80x), and a prior ASIC accelerator (5.1x). At the operation level, we propose O3BNN-R, which can detect redundant operations and prune them at run time. This works even for those that are highly data-dependent and unpredictable. With Binarized NNs (BNNs) as a case study, O3BNN-R can prune over 30% of the operations, without any accuracy loss, yielding speedups over state-of-the-art implementations on CPUs (1122x), GPUs (2.3x), and FPGAs (2.1x). At the bit level, we propose CQNN. CQNN embeds a Coarse-Grained Reconfigurable Architecture (CGRA) which can be programmed at runtime to support NN functions with various data-width requirements. Results show that CQNN can deliver us-level Quantized NN (QNN) inference. At the model level, we propose FPDeep, especially for training. In order to address model-level irregularity, FPDeep uses a novel model partitioning schemes to balance workload and storage among nodes. By using a hybrid of model and layer parallelism to train DNNs, FPDeep avoids the large gap that commonly occurs between training and testing accuracy due to the improper convergence to sharp minimizers (caused by large training batches). Results show that FPDeep provides scalable, fast, and accurate training and leads to 6.6x higher energy efficiency than GPUs
    corecore