22 research outputs found

    Accelerating Large Kernel Convolutions with Nested Winograd Transformation.pdf

    Full text link
    Recent literature has shown that convolutional neural networks (CNNs) with large kernels outperform vision transformers (ViTs) and CNNs with stacked small kernels in many computer vision tasks, such as object detection and image restoration. The Winograd transformation helps reduce the number of repetitive multiplications in convolution and is widely supported by many commercial AI processors. Researchers have proposed accelerating large kernel convolutions by linearly decomposing them into many small kernel convolutions and then sequentially accelerating each small kernel convolution with the Winograd algorithm. This work proposes a nested Winograd algorithm that iteratively decomposes a large kernel convolution into small kernel convolutions and proves it to be more effective than the linear decomposition Winograd transformation algorithm. Experiments show that compared to the linear decomposition Winograd algorithm, the proposed algorithm reduces the total number of multiplications by 1.4 to 10.5 times for computing 4x4 to 31x31 convolutions.Comment: published ref to https://ieeexplore.ieee.org/document/1032193

    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs

    Get PDF
    Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators

    eCNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference

    Full text link
    Convolutional neural networks (CNNs) have recently demonstrated superior quality for computational imaging applications. Therefore, they have great potential to revolutionize the image pipelines on cameras and displays. However, it is difficult for conventional CNN accelerators to support ultra-high-resolution videos at the edge due to their considerable DRAM bandwidth and power consumption. Therefore, finding a further memory- and computation-efficient microarchitecture is crucial to speed up this coming revolution. In this paper, we approach this goal by considering the inference flow, network model, instruction set, and processor design jointly to optimize hardware performance and image quality. We apply a block-based inference flow which can eliminate all the DRAM bandwidth for feature maps and accordingly propose a hardware-oriented network model, ERNet, to optimize image quality based on hardware constraints. Then we devise a coarse-grained instruction set architecture, FBISA, to support power-hungry convolution by massive parallelism. Finally,we implement an embedded processor---eCNN---which accommodates to ERNet and FBISA with a flexible processing architecture. Layout results show that it can support high-quality ERNets for super-resolution and denoising at up to 4K Ultra-HD 30 fps while using only DDR-400 and consuming 6.94W on average. By comparison, the state-of-the-art Diffy uses dual-channel DDR3-2133 and consumes 54.3W to support lower-quality VDSR at Full HD 30 fps. Lastly, we will also present application examples of high-performance style transfer and object recognition to demonstrate the flexibility of eCNN.Comment: 14 pages; appearing in IEEE/ACM International Symposium on Microarchitecture (MICRO), 201

    A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

    Full text link
    In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms

    Methodology for complex dataflow application development

    Get PDF
    This thesis addresses problems inherent to the development of complex applications for reconfig- urable systems. Many projects fail to complete or take much longer than originally estimated by relying on traditional iterative software development processes typically used with conventional computers. Even though designer productivity can be increased by abstract programming and execution models, e.g., dataflow, development methodologies considering the specific properties of reconfigurable systems do not exist. The first contribution of this thesis is a design methodology to facilitate systematic develop- ment of complex applications using reconfigurable hardware in the context of High-Performance Computing (HPC). The proposed methodology is built upon a careful analysis of the original application, a software model of the intended hardware system, an analytical prediction of performance and on-chip area usage, and an iterative architectural refinement to resolve identi- fied bottlenecks before writing a single line of code targeting the reconfigurable hardware. It is successfully validated using two real applications and both achieve state-of-the-art performance. The second contribution extends this methodology to provide portability between devices in two steps. First, additional tool support for contemporary multi-die Field-Programmable Gate Arrays (FPGAs) is developed. An algorithm to automatically map logical memories to hetero- geneous physical memories with special attention to die boundaries is proposed. As a result, only the proposed algorithm managed to successfully place and route all designs used in the evaluation while the second-best algorithm failed on one third of all large applications. Second, best practices for performance portability between different FPGA devices are collected and evaluated on a financial use case, showing efficient resource usage on five different platforms. The third contribution applies the extended methodology to a real, highly demanding emerging application from the radiotherapy domain. A Monte-Carlo based simulation of dose accumu- lation in human tissue is accelerated using the proposed methodology to meet the real time requirements of adaptive radiotherapy.Open Acces

    Hardware-aware design, search, and optimization of deep neural networks

    Get PDF
    Deep Learning has achieved remarkable progress in the last decade due to its powerful automatic representation capability for a variety of tasks, such as Image Recognition, Speech Recognition, and Machine Translation. This success is associated with network design, which is crucial to feature representation, leading to many innovative architectures such as Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), Graph Neural Network (GNN) and Transformers. A wide range of hardware platforms is available to accelerate the performance of Deep Neural Networks (DNNs), ranging from general-purpose hardware such as CPUs to special-purpose devices such as Tensor Processing Unit (TPU). High-performance computing systems such as GPUs effectively reduce the computation time of DNNs. Due to the slowing down of Moore's law, the research in developing Domain-Specific Hardware, which excels in its assigned tasks, has gained significance. Therefore, it is not straightforward to choose a platform that works in all scenarios, as it depends on the application and environment. Neural Architecture Search (NAS), a subset of Automatic Machine Learning (AutoML), is a method to automate the design process of Neural Network architecture on a given task and dataset without significant human intervention. The NAS method is an intelligent algorithm to automatically search for an efficient neural architecture to save the researcher's manual effort and computation time. Hardware-aware Neural Architecture Search (HW-NAS) is a class of problems whose goal is to search for networks that are not only accurate on the given dataset but also hardware-efficient in terms of latency, energy, size, etc. The resulting searched models outperform manually designed networks in several aspects, such as model performance and inference latency on the actual hardware. NAS and HW-NAS have been very successful in searching for efficient models that achieve State-of-the-art performance on many tasks, such as Image Classification, Object Detection, Machine Translation, etc. Pruning and Quantization are two important techniques to design lightweight, memory-efficient, and hardware-friendly methods for inference on a variety of devices such as CPU, GPU, ASIC, and FPGA. These methods successfully compressed large networks into smaller models with negligible accuracy or task performance loss. Neural Network Pruning refers to removing redundant or unimportant weights/nodes/neurons/filters parameters which do not significantly hinder model performance, thereby reducing the size and computational complexity of a model. Network Quantization converts the high-precision model weights/parameters (Floating point 32) to low precision (Integer 8, Integer 4). Quantization methodology has attracted much attention in academia and industry as inference of a model can be performed at a low precision with a negligible drop in accuracy, as opposed to training where a model is trained at high precision. Weight Pruning or element-wise pruning method shrinks the DNN model significantly and introduces a considerable sparsity in the weight matrices. The uniform systolic arrays in TPU and Tensor Cores in Volta and Turing GPU architectures are not explicitly designed to accelerate such sparse matrices. Therefore, the speedup due to weight pruning is negligible despite removing 90\% of the parameters. Later, several node pruning methods have been developed to resolve the sparsity bottlenecks. However, these methods do not consider the underlying Hardware dimension (size of the array, number of CPUs) or Tensor Core precision, leading to suboptimal performance. We develop Hardware Dimension Aware Pruning (HDAP) method for array-based accelerators, multi-core CPUs, and Tensor Core-enabled GPUs by considering the underlying dimension of the system. The node-pruned networks using the HDAP method achieved an average speedup of 3.2x and 4.2x, whereas the baseline method attained an average speedup of only 1.5x and 1.6x on Turing Tensor Core GPU and Eyeriss architecture, respectively. Hardware systems are often prone to soft errors or permanent faults due to external conditions or internal scaling. A lot of work has been done on the systolic array implementation and its reliability concerns in the past. However, their fault tolerance perspective with respect to DNNs is not yet fully understood with a fault model. In our work, we first present a fault model i.e., different sequences in which faults can occur on the systolic array, and co-design a fault-based and array size based Pruning (FPAP) algorithm with the intent of bypassing the faults and removing the internal redundancy at the same time for efficient inference. Tensor Cores in Nvidia Ampere 100 (A100) GPU support (1) 2:4 fine-grained sparse pruning where 2 out of every 4 elements are pruned and (2) traditional dense multiplication to achieve a good accuracy and performance trade-off. The A100 Tensor Core also takes advantage of 1-bit, 4-bit, and 8-bit multiplication to speed up the inference of a model. Hence, finding the right matrix type (dense or 2:4 sparse) along with the precision for each layer becomes a combinatorial problem. Neural Architecture Search (NAS) can alleviate such problems by automating the architecture design process instead of a brute-force search. In this work, we propose \textbf{(i)} Mixed Sparse and Precision Search (MSPS), a NAS framework to search for efficient sparse and mixed-precision quantized models within the predefined search space and fixed backbone neural network (Eg. ResNet50), and \textbf{(ii)} Architecture, Sparse and Precision Search (ASPS) to jointly search for kernel size and the number of filters, and sparse-precision combination of each layer. We illustrate the effectiveness of our methods targeting A100 Tensor Core on Nvidia GPUs by searching efficient sparse-mixed precision networks on ResNet50 and achieving better accuracy-latency trade-off models compared to the manually designed Uniform Sparse Int8 networks

    Visual Analysis Algorithms for Embedded Systems

    Get PDF
    Visual search systems are very popular applications, but on-line versions in 3G wireless environments suffer from network constraint like unstable or limited bandwidth that entail latency in query delivery, significantly degenerating the user’s experience. An alternative is to exploit the ability of the newest mobile devices to perform heterogeneous activities, like not only creating but also processing images. Visual feature extraction and compression can be performed on on-board Graphical Processing Units (GPUs), making smartphones capable of detecting a generic object (matching) in an exact way or of performing a classification activity. The latest trends in visual search have resulted in dedicated efforts in MPEG standardization, namely the MPEG CDVS (Compact Descriptor for Visual Search) standard. CDVS is an ISO/IEC standard used to extract a compressed descriptor. As regards to classification, in recent years neural networks have acquired an impressive importance and have been applied to several domains. This thesis focuses on the use of Deep Neural networks to classify images by means of Deep learning. Implementing visual search algorithms and deep learning-based classification on embedded environments is not a mere code-porting activity. Recent embedded devices are equipped with a powerful but limited number of resources, like development boards such as GPGPUs. GPU architectures fit particularly well, because they allow to execute more operations in parallel, following the SIMD (Single Instruction Multiple Data) paradigm. Nonetheless, it is necessary to make good design choices for the best use of available hardware and memory. For visual search, following the MPEG CDVS standard, the contribution of this thesis is an efficient feature computation phase, a parallel CDVS detector, completely implemented on embedded devices supporting the OpenCL framework. Algorithmic choices and implementation details to target the intrinsic characteristics of the selected embedded platforms are presented and discussed. Experimental results on several GPUs show that the GPU-based solution is up to 7× faster than the CPU-based one. This speed-up opens new visual search scenarios exploiting entire real-time on-board computations with no data transfer. As regards to the use of Deep convolutional neural networks for off-line image classification, their computational and memory requirements are huge, and this is an issue on embedded devices. Most of the complexity derives from the convolutional layers and in particular from the matrix multiplications they entail. The contribution of this thesis is a self-contained implementation to image classification providing common layers used in neural networks. The approach relies on a heterogeneous CPU-GPU scheme for performing convolutions in the transform domain. Experimental results show that the heterogeneous scheme described in this thesis boasts a 50× speedup over the CPU-only reference and outperforms a GPU-based reference by 2×, while slashing the power consumption by nearly 30%

    Enabling on-device domain adaptation of convolutional neural networks

    Get PDF
    Convolutional Neural Networks (CNN) are used ubiquitously in computer vision applications ranging from image classification to video-stream object detection. However due to the large memory and compute costs of executing CNNs, specialised hardware such as GPUs or ASICs are required to perform both CNN inference and training within reasonable time and memory budgets. Consequently, most applications today perform both CNN inference and training on servers where user data is sent from an edge device back to a server to process. This raises data privacy concerns and places a strict necessity for good edge-server communication links. Recently, with improvements in the specialised hardware (especially GPUs) available on edge devices, an increased number of applications have moved the inference stage onto the edge, but few to none have considered performing training on an edge device. With a focus on CNNs used for image classification, the work in this PhD explores when it would be useful to perform retraining of networks on an edge device, what the gains would be of doing so and how one can perform such training even in resource constrained settings. This exploration begins with the assumption that the classes observed by the model upon deployment is a subset of the classes present in the dataset used to train the model initially. This scenario is simulated by constructing semantically meaningful subsets of classes from existing large image classification datasets (eg. ImageNet) and exploring the gains, in terms of classification accuracy and the memory consumption and latency of the inference and training stages, that can be achieved by pruning (architecture modification) and retraining (weights adaptation) a deployed network to the observed class distribution. The exploration is split into three stages. First, an oracle is constructed that predicts the gains that can be achieved by pruning and retraining a network under the assumption that we know the exact label of each image observed upon deployment and do not have any hardware resource constraints. This demonstrates the accuracy and performance gains that can theoretically be achieved per network and subset combination. The significant gains demonstrated here for certain subsets of data motivate the remainder of the work in this PhD. The works that follow explore ways to perform such adaptation on hardware that is resource constrained and also when there is uncertainty in the labels of the observed data-points that are used to perform this adaptation. Pruning was utilised as a method to enable training to be performed on resource constraint hardware by reducing the memory and latency footprints of the training process. When doing so, it was observed that depending on the manner in which a network is pruned, a set of networks that all consume the same amount of memory for storing weights, can each have drastically different latencies and memory consumptions while performing training. Hence, the size of a stored model is not a useful predictor of which networks can be feasibly trained within edge hardware resource budgets. To cater for this, a novel, accurate and data-driven model for predicting the training memory consumption and latency of a network on a specific target hardware and execution framework (PyTorch, Tensorflow, etc.) combination is proposed. Doing so enables the selection of a pruned network, whose memory consumption and latency of training fits within the available memory and latency budgets that are dictated by the target hardware and application. This then allows for the network to be adapted to the observed data distribution. An additional benefit of using the proposed data-driven model is that it allows to rapidly create new models specific to each network, hardware and execution framework combination. Finally, the analysis is extended to account for uncertainty in the class labels of the observed data distribution. This uncertainty in the label distribution can negatively impact any attempts to retrain the network. To combat this, a novel Variational Auto-Encoder (VAE) based retraining methodology that uses uncertain predictions of the label of an image to adapt the weights of the network to the observed data distribution on-device is proposed. In doing so, the work in this PhD answers the questions of why we should aim to train a network on the edge, how we can select networks that fit within the available hardware resource constraints and how we could account for the uncertainty in labels that arises when we do not have access to ground-truth labels when training. We also propose possibilities for future research directions that could extend and adapt the ideas of this thesis to other applications.Open Acces

    Implementazione ed ottimizzazione di algoritmi per l'analisi di Biomedical Big Data

    Get PDF
    Big Data Analytics poses many challenges to the research community who has to handle several computational problems related to the vast amount of data. An increasing interest involves Biomedical data, aiming to get the so-called personalized medicine, where therapy plans are designed on the specific genotype and phenotype of an individual patient and algorithm optimization plays a key role to this purpose. In this work we discuss about several topics related to Biomedical Big Data Analytics, with a special attention to numerical issues and algorithmic solutions related to them. We introduce a novel feature selection algorithm tailored on omics datasets, proving its efficiency on synthetic and real high-throughput genomic datasets. We tested our algorithm against other state-of-art methods obtaining better or comparable results. We also implemented and optimized different types of deep learning models, testing their efficiency on biomedical image processing tasks. Three novel frameworks for deep learning neural network models development are discussed and used to describe the numerical improvements proposed on various topics. In the first implementation we optimize two Super Resolution models showing their results on NMR images and proving their efficiency in generalization tasks without a retraining. The second optimization involves a state-of-art Object Detection neural network architecture, obtaining a significant speedup in computational performance. In the third application we discuss about femur head segmentation problem on CT images using deep learning algorithms. The last section of this work involves the implementation of a novel biomedical database obtained by the harmonization of multiple data sources, that provides network-like relationships between biomedical entities. Data related to diseases and other biological relates were mined using web-scraping methods and a novel natural language processing pipeline was designed to maximize the overlap between the different data sources involved in this project

    Compiler-centric across-stack deep learning acceleration

    Get PDF
    Optimizing the deployment of Deep Neural Networks (DNNs) is hard. Despite deep learning approaches increasingly providing state-of-the-art solutions to a variety of difficult problems, such as computer vision and natural language processing, DNNs can be prohibitively expensive, for example, in terms of inference time or memory usage. Effective exploration of the design space requires a holistic approach, including a range of topics from machine learning, systems, and hardware. The rapid proliferation of deep learning applications has raised demand for efficient exploration and acceleration of deep learning based solutions. However, managing the range of optimization techniques, as well as how they interact with each other across the stack is a non-trivial task. A family of emerging specialized compilers for deep learning, tensor compilers, appear to be a strong candidate to help manage the complexity of across-stack optimization choices, and enable new approaches. This thesis presents new techniques and explorations of the Deep Learning Acceleration Stack (DLAS), with the perspective that the tensor compiler will increasingly be the center of this stack. First, we motivate the challenges in exploring DLAS, by describing the experience of running a perturbation study varying parameters at every layer of the stack. The core of the study is implemented using a tensor compiler, which reduces the complexity of evaluating the wide range of variants, although still requires a significant engineering effort to realize. Next, we develop a new algorithm for grouped convolution, a model optimization technique for which existing solutions provided poor inference time scaling. We implement and optimize our algorithm using a tensor compiler, outperforming existing approaches by 5.1× on average (arithmetic mean). Finally, we propose a technique, transfer-tuning, to reduce the search time required for automatic tensor compiler code optimization, reducing the search time required by 6.5× on average. The techniques and contributions of this thesis across these interconnected domains demonstrate the exciting potential of tensor compilers to simplify and improve design space exploration for DNNs, and their deployment. The outcomes of this thesis enable new lines of research to enable machine learning developers to keep up with the rapidly evolving landscape of neural architectures and hardware
    corecore