433 research outputs found

    High throughput spatial convolution filters on FPGAs

    Get PDF
    Digital signal processing (DSP) on field- programmable gate arrays (FPGAs) has long been appealing because of the inherent parallelism in these computations that can be easily exploited to accelerate such algorithms. FPGAs have evolved significantly to further enhance the mapping of these algorithms, included additional hard blocks, such as the DSP blocks found in modern FPGAs. Although these DSP blocks can offer more efficient mapping of DSP computations, they are primarily designed for 1-D filter structures. We present a study on spatial convolutional filter implementations on FPGAs, optimizing around the structure of the DSP blocks to offer high throughput while maintaining the coefficient flexibility that other published architectures usually sacrifice. We show that it is possible to implement large filters for large 4K resolution image frames at frame rates of 30–60 FPS, while maintaining functional flexibility

    SBNet: Sparse Blocks Network for Fast Inference

    Full text link
    Conventional deep convolutional neural networks (CNNs) apply convolution operators uniformly in space across all feature maps for hundreds of layers - this incurs a high computational cost for real-time applications. For many problems such as object detection and semantic segmentation, we are able to obtain a low-cost computation mask, either from a priori problem knowledge, or from a low-resolution segmentation network. We show that such computation masks can be used to reduce computation in the high-resolution main network. Variants of sparse activation CNNs have previously been explored on small-scale tasks and showed no degradation in terms of object classification accuracy, but often measured gains in terms of theoretical FLOPs without realizing a practical speed-up when compared to highly optimized dense convolution implementations. In this work, we leverage the sparsity structure of computation masks and propose a novel tiling-based sparse convolution algorithm. We verified the effectiveness of our sparse CNN on LiDAR-based 3D object detection, and we report significant wall-clock speed-ups compared to dense convolution without noticeable loss of accuracy.Comment: 10 pages, CVPR 201

    Efficient Hardware Implementation of Deep Learning Networks Based on the Convolutional Neural Network

    Get PDF
    Image classification, speech processing, autonomous driving, and medical diagnosis have made the adoption of Deep Neural Networks (DNN) mainstream. Many deep networks such as AlexNet, GoogleNet, ResidualNet, MobileNet, YOLOv3 and Transformers have achieved immense success and popularity. However, implementing these deep and complex networks in hardware is a challenging feat. The growing demand of DNN applications in mobile devices and data centers have led the researchers to explore application specific hardware accelerators for DNNs. There have been numerous hardware and software based solutions to improve DNN throughput, latency, performance and accuracy. Any solution for hardware acceleration needs to optimize in a space confined by these metrics. Hardware acceleration of Deep Neural Networks (DNN) is a highly effective and viable solution for running them on mobile devices. The power of DNN is now available at the edge in a compact and power-efficient form factor because of hardware acceleration. In this thesis, we introduce a novel architecture that uses a generalized method called Single Input Partial Product 2-Dimensional Convolution (SIPP2D Convolution) which calculates a 2-D convolution in a fast and expedient manner. We present the exploration designs that have culminated into SIPP2D and emphasize its benefits. SIPP2D architecture prevents the re-fetching of input weights for the calculation of partial products. It can calculate the output of any input size and kernel size with a low memory-traffic while maintaining a low latency and high throughput compared to other popular techniques. In addition to being compatible with any input and kernel size, SIPP2D architecture can be modified to support any allowable stride. We describe the data flow and algorithmic modifications to SIPP2D which extends its capabilities to accommodate multi-stride convolutions. Supporting multi-stride convolutions is an essential feature addition to SIPP2D architecture, increasing its versatility and network agnostic character for convolutional type DNNs. Along with architectural explorations, we have also performed research in the area of model optimization. It is widely understood that any change on the algorithmic level of the network pays significant dividends at the hardware level. Compression and optimization techniques such as pruning and quantization help reduce the size of the model while maintaining the accuracy at an acceptable level. Thus, by combining techniques such as channel pruning with SIPP2D we can only boost its performance. In this thesis, we examine the performance of channel pruned SIPP2D compared to other compressed models. Traditionally, quantization of weights and inputs are used to reduce the memory transfer and power consumption. However, quantizing the outputs of layers can be a challenge since the output of each layer changes with the input. In our research, we use quantization on the output of each layer for AlexNet and VGGNet-16 to analyze the effect it has on accuracy. We use Signal to Noise Quantization Ratio (SQNR) to empirically determine the integer length (IL) as well as the fractional length (FL) for the fixed point precision that can yields the lowest SQNR and highest accuracy. Based on our observations, we can report that accuracy is sensitive to fractional length as well as integer length. For AlexNet, we observe deterioration in accuracy as the word length decreases. The Top -5 accuracy goes from 77% for floating point precision to 56% for a WL of 12 and FL of 8. The results are similar in the case of VGGNet-16. The Top-5 accuracy for VGGNet-16 decreases from 82% for floating point to 30% for a WL of 12 and FL of 8. In addition to the small word length, we observe the accuracy to be highly dependent on the integer length as well as the fractional length. We have also done analysis on the loss after retraining post quantization. We use polynomial fitting to achieve a relationship with fractional length and the drop in accuracy still sustained after retraining a quantized network. In summary, the winning combination of the enhanced SIPP2D architecture and compression techniques such as channel pruning and quantization techniques is highly advantageous and conducive to widespread adoption. SIPP2D architecture, with its flexible data flow and algorithmic modifications to support multi-stride convolutions, offers a powerful and versatile framework for deep neural networks

    Cross-Modal Learning with 3D Deformable Attention for Action Recognition

    Full text link
    An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and Penn Action datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented.Comment: 10 pages, 8 figure

    Performance Optimization of Memory Intensive Applications on FPGA Accelerator

    Get PDF
    L'abstract è presente nell'allegato / the abstract is in the attachmen

    Pixel-level semantic understanding of ophthalmic images and beyond

    Get PDF
    Computer-assisted semantic image understanding constitutes the substrate of applications that range from biomarker detection to intraoperative guidance or street scene understanding for self-driving systems. This PhD thesis is on the development of deep learning-based, pixel-level, semantic segmentation methods for medical and natural images. For vessel segmentation in OCT-A, a method comprising iterative refinement of the extracted vessel maps and an auxiliary loss function that penalizes structural inaccuracies, is proposed and tested on data captured from real clinical conditions comprising various pathological cases. Ultimately, the presented method enables the extraction of a detailed vessel map of the retina with potential applications to diagnostics or intraoperative localization. Furthermore, for scene segmentation in cataract surgery, the major challenge of class imbalance is identified among several factors. Subsequently, a method addressing it is proposed, achieving state-of-the-art performance on a challenging public dataset. Accurate semantic segmentation in this domain can be used to monitor interactions between tools and anatomical parts for intraoperative guidance and safety. Finally, this thesis proposes a novel contrastive learning framework for supervised semantic segmentation, that aims to improve the discriminative power of features in deep neural networks. The proposed approach leverages contrastive loss function applied both at multiple model layers and across them. Importantly, the proposed framework is easy to combine with various model architectures and is experimentally shown to significantly improve performance on both natural and medical domain
    • …
    corecore