509 research outputs found

    DORY: Automatic End-to-End Deployment of Real-World DNNs on Low-Cost IoT MCUs

    Get PDF
    The deployment of Deep Neural Networks (DNNs) on end-nodes at the extreme edge of the Internet-of-Things is a critical enabler to support pervasive Deep Learning-enhanced applications. Low-Cost MCU-based end-nodes have limited on-chip memory and often replace caches with scratchpads, to reduce area overheads and increase energy efficiency -- requiring explicit DMA-based memory transfers between different levels of the memory hierarchy. Mapping modern DNNs on these systems requires aggressive topology-dependent tiling and double-buffering. In this work, we propose DORY (Deployment Oriented to memoRY) - an automatic tool to deploy DNNs on low cost MCUs with typically less than 1MB of on-chip SRAM memory. DORY abstracts tiling as a Constraint Programming (CP) problem: it maximizes L1 memory utilization under the topological constraints imposed by each DNN layer. Then, it generates ANSI C code to orchestrate off- and on-chip transfers and computation phases. Furthermore, to maximize speed, DORY augments the CP formulation with heuristics promoting performance-effective tile sizes. As a case study for DORY, we target GreenWaves Technologies GAP8, one of the most advanced parallel ultra-low power MCU-class devices on the market. On this device, DORY achieves up to 2.5x better MAC/cycle than the GreenWaves proprietary software solution and 18.1x better than the state-of-the-art result on an STM32-F746 MCU on single layers. Using our tool, GAP-8 can perform end-to-end inference of a 1.0-MobileNet-128 network consuming just 63 pJ/MAC on average @ 4.3 fps - 15.4x better than an STM32-F746. We release all our developments - the DORY framework, the optimized backend kernels, and the related heuristics - as open-source software.Comment: 14 pages, 12 figures, 4 tables, 2 listings. Accepted for publication in IEEE Transactions on Computers (https://ieeexplore.ieee.org/document/9381618

    GPU optimizations for a production molecular docking code

    Full text link
    Thesis (M.Sc.Eng.) -- Boston UniversityScientists have always felt the desire to perform computationally intensive tasks that surpass the capabilities of conventional single core computers. As a result of this trend, Graphics Processing Units (GPUs) have come to be increasingly used for general computation in scientific research. This field of GPU acceleration is now a vast and mature discipline. Molecular docking, the modeling of the interactions between two molecules, is a particularly computationally intensive task that has been the subject of research for many years. It is a critical simulation tool used for the screening of protein compounds for drug design and in research of the nature of life itself. The PIPER molecular docking program was previously accelerated using GPUs, achieving a notable speedup over conventional single core implementation. Since its original release the development of the CPU based PIPER has not ceased, and it is now a mature and fast parallel code. The GPU version, however, still contains many potential points for optimization. In the current work, we present a new version of GPU PIPER that attains a 3.3x speedup over a parallel MPI version of PIPER running on an 8 core machine and using the optimized Intel Math Kernel Library. We achieve this speedup by optimizing existing kernels for modern GPU architectures and migrating critical code segments to the GPU. In particular, we both improve the runtime of the filtering and scoring stages by more than an order of magnitude, and move all molecular data permanently to the GPU to improve data locality. This new speedup is obtained while retaining a computational accuracy virtually identical to the CPU based version. We also demonstrate that, due to the algorithmic dependencies of the PIPER algorithm on the 3D Fast Fourier Transform, our GPU PIPER will likely remain proportionally faster than equivalent CPU based implementations, and with little room for further optimizations. This new GPU accelerated version of PIPER is integrated as part of the ClusPro molecular docking and analysis server at Boston University. ClusPro has over 4000 registered users and more than 50000 jobs run over the past 4 years

    On the Real-Time Performance, Robustness and Accuracy of Medical Image Non-Rigid Registration

    Get PDF
    Three critical issues about medical image non-rigid registration are performance, robustness and accuracy. A registration method, which is capable of responding timely with an accurate alignment, robust against the variation of the image intensity and the missing data, is desirable for its clinical use. This work addresses all three of these issues. Unacceptable execution time of Non-rigid registration (NRR) often presents a major obstacle to its routine clinical use. We present a hybrid data partitioning method to parallelize a NRR method on a cooperative architecture, which enables us to get closer to the goal: accelerating using architecture rather than designing a parallel algorithm from scratch. to further accelerate the performance for the GPU part, a GPU optimization tool is provided to automatically optimize GPU execution configuration.;Missing data and variation of the intensity are two severe challenges for the robustness of the registration method. A novel point-based NRR method is presented to resolve mapping function (deformation field) with the point correspondence missing. The novelty of this method lies in incorporating a finite element biomechanical model into an Expectation and Maximization (EM) framework to resolve the correspondence and mapping function simultaneously. This method is extended to deal with the deformation induced by tumor resection, which imposes another challenge, i.e. incomplete intra-operative MRI. The registration is formulated as a three variable (Correspondence, Deformation Field, and Resection Region) functional minimization problem and resolved by a Nested Expectation and Maximization framework. The experimental results show the effectiveness of this method in correcting the deformation in the vicinity of the tumor. to deal with the variation of the intensity, two different methods are developed depending on the specific application. For the mono-modality registration on delayed enhanced cardiac MRI and cine MRI, a hybrid registration method is designed by unifying both intensity- and feature point-based metrics into one cost function. The experiment on the moving propagation of suspicious myocardial infarction shows effectiveness of this hybrid method. For the multi-modality registration on MRI and CT, a Mutual Information (MI)-based NRR is developed by modeling the underlying deformation as a Free-Form Deformation (FFD). MI is sensitive to the variation of the intensity due to equidistant bins. We overcome this disadvantage by designing a Top-to-Down K-means clustering method to naturally group similar intensities into one bin. The experiment shows this method can increase the accuracy of the MI-based registration.;In image registration, a finite element biomechanical model is usually employed to simulate the underlying movement of the soft tissue. We develop a multi-tissue mesh generation method to build a heterogeneous biomechanical model to realistically simulate the underlying movement of the brain. We focus on the following four critical mesh properties: tissue-dependent resolution, fidelity to tissue boundaries, smoothness of mesh surfaces, and element quality. Each mesh property can be controlled on a tissue level. The experiments on comparing the homogeneous model with the heterogeneous model demonstrate the effectiveness of the heterogeneous model in improving the registration accuracy

    Automatic synthesis and optimization of chip multiprocessors

    Get PDF
    The microprocessor technology has experienced an enormous growth during the last decades. Rapid downscale of the CMOS technology has led to higher operating frequencies and performance densities, facing the fundamental issue of power dissipation. Chip Multiprocessors (CMPs) have become the latest paradigm to improve the power-performance efficiency of computing systems by exploiting the parallelism inherent in applications. Industrial and prototype implementations have already demonstrated the benefits achieved by CMPs with hundreds of cores.CMP architects are challenged to take many complex design decisions. Only a few of them are:- What should be the ratio between the core and cache areas on a chip?- Which core architectures to select?- How many cache levels should the memory subsystem have?- Which interconnect topologies provide efficient on-chip communication?These and many other aspects create a complex multidimensional space for architectural exploration. Design Automation tools become essential to make the architectural exploration feasible under the hard time-to-market constraints. The exploration methods have to be efficient and scalable to handle future generation on-chip architectures with hundreds or thousands of cores.Furthermore, once a CMP has been fabricated, the need for efficient deployment of the many-core processor arises. Intelligent techniques for task mapping and scheduling onto CMPs are necessary to guarantee the full usage of the benefits brought by the many-core technology. These techniques have to consider the peculiarities of the modern architectures, such as availability of enhanced power saving techniques and presence of complex memory hierarchies.This thesis has several objectives. The first objective is to elaborate the methods for efficient analytical modeling and architectural design space exploration of CMPs. The efficiency is achieved by using analytical models instead of simulation, and replacing the exhaustive exploration with an intelligent search strategy. Additionally, these methods incorporate high-level models for physical planning. The related contributions are described in Chapters 3, 4 and 5 of the document.The second objective of this work is to propose a scalable task mapping algorithm onto general-purpose CMPs with power management techniques, for efficient deployment of many-core systems. This contribution is explained in Chapter 6 of this document.Finally, the third objective of this thesis is to address the issues of the on-chip interconnect design and exploration, by developing a model for simultaneous topology customization and deadlock-free routing in Networks-on-Chip. The developed methodology can be applied to various classes of the on-chip systems, ranging from general-purpose chip multiprocessors to application-specific solutions. Chapter 7 describes the proposed model.The presented methods have been thoroughly tested experimentally and the results are described in this dissertation. At the end of the document several possible directions for the future research are proposed
    • …
    corecore