175 research outputs found

    Bit Fusion: Bit-Level Dynamically Composable Architecture for Accelerating Deep Neural Networks

    Full text link
    Fully realizing the potential of acceleration for Deep Neural Networks (DNNs) requires understanding and leveraging algorithmic properties. This paper builds upon the algorithmic insight that bitwidth of operations in DNNs can be reduced without compromising their classification accuracy. However, to prevent accuracy loss, the bitwidth varies significantly across DNNs and it may even be adjusted for each layer. Thus, a fixed-bitwidth accelerator would either offer limited benefits to accommodate the worst-case bitwidth requirements, or lead to a degradation in final accuracy. To alleviate these deficiencies, this work introduces dynamic bit-level fusion/decomposition as a new dimension in the design of DNN accelerators. We explore this dimension by designing Bit Fusion, a bit-flexible accelerator, that constitutes an array of bit-level processing elements that dynamically fuse to match the bitwidth of individual DNN layers. This flexibility in the architecture enables minimizing the computation and the communication at the finest granularity possible with no loss in accuracy. We evaluate the benefits of BitFusion using eight real-world feed-forward and recurrent DNNs. The proposed microarchitecture is implemented in Verilog and synthesized in 45 nm technology. Using the synthesis results and cycle accurate simulation, we compare the benefits of Bit Fusion to two state-of-the-art DNN accelerators, Eyeriss and Stripes. In the same area, frequency, and process technology, BitFusion offers 3.9x speedup and 5.1x energy savings over Eyeriss. Compared to Stripes, BitFusion provides 2.6x speedup and 3.9x energy reduction at 45 nm node when BitFusion area and frequency are set to those of Stripes. Scaling to GPU technology node of 16 nm, BitFusion almost matches the performance of a 250-Watt Titan Xp, which uses 8-bit vector instructions, while BitFusion merely consumes 895 milliwatts of power

    Parallelization of dynamic programming recurrences in computational biology

    Get PDF
    The rapid growth of biosequence databases over the last decade has led to a performance bottleneck in the applications analyzing them. In particular, over the last five years DNA sequencing capacity of next-generation sequencers has been doubling every six months as costs have plummeted. The data produced by these sequencers is overwhelming traditional compute systems. We believe that in the future compute performance, not sequencing, will become the bottleneck in advancing genome science. In this work, we investigate novel computing platforms to accelerate dynamic programming algorithms, which are popular in bioinformatics workloads. We study algorithm-specific hardware architectures that exploit fine-grained parallelism in dynamic programming kernels using field-programmable gate arrays: FPGAs). We advocate a high-level synthesis approach, using the recurrence equation abstraction to represent dynamic programming and polyhedral analysis to exploit parallelism. We suggest a novel technique within the polyhedral model to optimize for throughput by pipelining independent computations on an array. This design technique improves on the state of the art, which builds latency-optimal arrays. We also suggest a method to dynamically switch between a family of designs using FPGA reconfiguration to achieve a significant performance boost. We have used polyhedral methods to parallelize the Nussinov RNA folding algorithm to build a family of accelerators that can trade resources for parallelism and are between 15-130x faster than a modern dual core CPU implementation. A Zuker RNA folding accelerator we built on a single workstation with four Xilinx Virtex 4 FPGAs outperforms 198 3 GHz Intel Core 2 Duo processors. Furthermore, our design running on a single FPGA is an order of magnitude faster than competing implementations on similar-generation FPGAs and graphics processors. Our work is a step toward the goal of automated synthesis of hardware accelerators for dynamic programming algorithms

    EFFICIENTLY ACCELERATING SPARSE PROBLEMS BY ENABLING STREAM ACCESSES TO MEMORY USING HARDWARE/SOFTWARE TECHNIQUES

    Get PDF
    The objective of this research is to improve the performance of sparse problems that have a wide range of applications but still, suffer from serious challenges when running on modern computers. In summary, the challenges include the underutilization of available memory bandwidth because of lack of spatial locality, dependencies in computation, or slow mechanisms for decompressing the sparse data, and the underutilization of concurrent compute engines because of the distribution of non-zero values in sparse data. Our key insight to address the aforementioned challenges is that based on the type of the problem, we either use an intelligent reduction tree near memory to process data while gathering them from random locations of memory, transform the computations mathematically to extract more parallelism, modify the distribution of non-zero elements, or change the representation of sparse data. By applying such techniques, the execution adapts more effectively to given hardware resources. To this end, this research introduces hardware/software techniques to enable stream accesses to memory for accelerating four main categories of sparse problems including the inference of recommendation systems, iterative solvers of partial differential equations (PDEs), deep neural networks (DNNs), and graph algorithms.Ph.D

    Mapping applications onto FPGA-centric clusters

    Full text link
    High Performance Computing (HPC) is becoming increasingly important throughout science and engineering as ever more complex problems must be solved through computational simulations. In these large computational applications, the latency of communication between processing nodes is often the key factor that limits performance. An emerging alternative computer architecture that addresses the latency problem is the FPGA-centric cluster (FCC); in these systems, the devices (FPGAs) are directly interconnected and thus many layers of hardware and software are avoided. The result can be scalability not currently achievable with other technologies. In FCCs, FPGAs serve multiple functions: accelerator, network interface card (NIC), and router. Moreover, because FPGAs are configurable, there is substantial opportunity to tailor the router hardware to the application; previous work has demonstrated that such application-aware configuration can effect a substantial improvement in hardware efficiency. One constraint of FCCs is that it is convenient for their interconnect to be static, direct, and have a two or three dimensional mesh topology. Thus, applications that are naturally of a different dimensionality (have a different logical topology) from that of the FCC must be remapped to obtain optimal performance. In this thesis we study various aspects of the mapping problem for FCCs. There are two major research thrusts. The first is finding the optimal mapping of logical to physical topology. This problem has received substantial attention by both the theory community, where topology mapping is referred to as graph embedding, and by the High Performance Computing (HPC) community, where it is a question of process placement. We explore the implications of the different mapping strategies on communication behavior in FCCs, especially on resulting load imbalance. The second major research thrust is built around the hypothesis that applications that need to be remapped (due to differing logical and physical topologies) will have different optimal router configurations from those applications that do not. For example, due to remapping, some virtual or physical communication links may have little occupancy; therefore fewer resources should be allocated to them. Critical here is the creation of a new set of parameterized hardware features that can be configured to best handle load imbalances caused by remapping. These two thrusts form a codesign loop: certain mapping algorithms may be differentially optimal due to application-aware router reconfiguration that accounts for this mapping. This thesis has four parts. The first part introduces the background and previous work related to communication in general and, in particular, how it is implemented in FCCs. We build on previous work on application-aware router configuration. The second part introduces topology mapping mechanisms including those derived from graph embeddings and a greedy algorithm commonly used in HPC. In the third part, topology mappings are evaluated for performance and imbalance; we note that different mapping strategies lead to different imbalances both in the overall network and in each node. The final part introduces reconfigure router design that allocates resources based on different imbalance situations caused by different mapping behaviors

    Etude et mise en place d’une plateforme d’adaptation multiservice embarquée pour la gestion de flux multimédia à différents niveaux logiciels et matériels

    Get PDF
    On the one hand, technology advances have led to the expansion of the handheld devices market. Thanks to this expansion, people are more and more connected and more and more data are exchanged over the Internet. On the other hand, this huge amound of data imposes drastic constrains in order to achieve sufficient quality. The Internet is now showing its limits to assure such quality. To answer nowadays limitations, a next generation Internet is envisioned. This new network takes into account the content nature (video, audio, ...) and the context (network state, terminal capabilities ...) to better manage its own resources. To this extend, video manipulation is one of the key concept that is highlighted in this arising context. Video content is more and more consumed and at the same time requires more and more resources. Adapting videos to the network state (reducing its bitrate to match available bandwidth) or to the terminal capabilities (screen size, supported codecs, …) appears mandatory and is foreseen to take place in real time in networking devices such as home gateways. However, video adaptation is a resource intensive task and must be implemented using hardware accelerators to meet the desired low cost and real time constraints.In this thesis, content- and context-awareness is first analyzed to be considered at the network side. Secondly, a generic low cost video adaptation system is proposed and compared to existing solutions as a trade-off between system complexity and quality. Then, hardware conception is tackled as this system is implemented in an FPGA based architecture. Finally, this system is used to evaluate the indirect effects of video adaptation; energy consumption reduction is achieved at the terminal side by reducing video characteristics thus permitting an increased user experience for End-Users.Les avancées technologiques ont permis la commercialisation à grande échelle de terminaux mobiles. De ce fait, l’homme est de plus en plus connecté et partout. Ce nombre grandissant d’usagers du réseau ainsi que la forte croissance du contenu disponible, aussi bien d’un point de vue quantitatif que qualitatif saturent les réseaux et l’augmentation des moyens matériels (passage à la fibre optique) ne suffisent pas. Pour surmonter cela, les réseaux doivent prendre en compte le type de contenu (texte, vidéo, ...) ainsi que le contexte d’utilisation (état du réseau, capacité du terminal, ...) pour assurer une qualité d’expérience optimum. A ce sujet, la vidéo fait partie des contenus les plus critiques. Ce type de contenu est non seulement de plus en plus consommé par les utilisateurs mais est aussi l’un des plus contraignant en terme de ressources nécéssaires à sa distribution (taille serveur, bande passante, …). Adapter un contenu vidéo en fonction de l’état du réseau (ajuster son débit binaire à la bande passante) ou des capacités du terminal (s’assurer que le codec soit nativement supporté) est indispensable. Néanmoins, l’adaptation vidéo est un processus qui nécéssite beaucoup de ressources. Cela est antinomique à son utilisation à grande echelle dans les appareils à bas coûts qui constituent aujourd’hui une grande part dans l’ossature du réseau Internet. Cette thèse se concentre sur la conception d’un système d’adaptation vidéo à bas coût et temps réel qui prendrait place dans ces réseaux du futur. Après une analyse du contexte, un système d’adaptation générique est proposé et évalué en comparaison de l’état de l’art. Ce système est implémenté sur un FPGA afin d’assurer les performances (temps-réels) et la nécessité d’une solution à bas coût. Enfin, une étude sur les effets indirects de l’adaptation vidéo est menée
    corecore