53 research outputs found

    Experts Weights Averaging: A New General Training Scheme for Vision Transformers

    Full text link
    Structural re-parameterization is a general training scheme for Convolutional Neural Networks (CNNs), which achieves performance improvement without increasing inference cost. As Vision Transformers (ViTs) are gradually surpassing CNNs in various visual tasks, one may question: if a training scheme specifically for ViTs exists that can also achieve performance improvement without increasing inference cost? Recently, Mixture-of-Experts (MoE) has attracted increasing attention, as it can efficiently scale up the capacity of Transformers at a fixed cost through sparsely activated experts. Considering that MoE can also be viewed as a multi-branch structure, can we utilize MoE to implement a ViT training scheme similar to structural re-parameterization? In this paper, we affirmatively answer these questions, with a new general training strategy for ViTs. Specifically, we decouple the training and inference phases of ViTs. During training, we replace some Feed-Forward Networks (FFNs) of the ViT with specially designed, more efficient MoEs that assign tokens to experts by random uniform partition, and perform Experts Weights Averaging (EWA) on these MoEs at the end of each iteration. After training, we convert each MoE into an FFN by averaging the experts, transforming the model back into original ViT for inference. We further provide a theoretical analysis to show why and how it works. Comprehensive experiments across various 2D and 3D visual tasks, ViT architectures, and datasets validate the effectiveness and generalizability of the proposed training scheme. Besides, our training scheme can also be applied to improve performance when fine-tuning ViTs. Lastly, but equally important, the proposed EWA technique can significantly improve the effectiveness of naive MoE in various 2D visual small datasets and 3D visual tasks.Comment: 12 pages, 2 figure

    Streaming CTR Prediction: Rethinking Recommendation Task for Real-World Streaming Data

    Full text link
    The Click-Through Rate (CTR) prediction task is critical in industrial recommender systems, where models are usually deployed on dynamic streaming data in practical applications. Such streaming data in real-world recommender systems face many challenges, such as distribution shift, temporal non-stationarity, and systematic biases, which bring difficulties to the training and utilizing of recommendation models. However, most existing studies approach the CTR prediction as a classification task on static datasets, assuming that the train and test sets are independent and identically distributed (a.k.a, i.i.d. assumption). To bridge this gap, we formulate the CTR prediction problem in streaming scenarios as a Streaming CTR Prediction task. Accordingly, we propose dedicated benchmark settings and metrics to evaluate and analyze the performance of the models in streaming data. To better understand the differences compared to traditional CTR prediction tasks, we delve into the factors that may affect the model performance, such as parameter scale, normalization, regularization, etc. The results reveal the existence of the ''streaming learning dilemma'', whereby the same factor may have different effects on model performance in the static and streaming scenarios. Based on the findings, we propose two simple but inspiring methods (i.e., tuning key parameters and exemplar replay) that significantly improve the effectiveness of the CTR models in the new streaming scenario. We hope our work will inspire further research on streaming CTR prediction and help improve the robustness and adaptability of recommender systems

    An End-to-End Deep Neural Architecture for Optical Character Verification and Recognition in Retail Food Packaging

    Get PDF
    There exist various types of information in retail food packages, including food product name, ingredients list and use by date. The correct recognition and coding of use by dates is especially critical in ensuring proper distribution of the product to the market and eliminating potential health risks caused by erroneous mislabelling. The latter can have a major negative effect on the health of consumers and consequently raise legal issues for suppliers. In this work, an end-to-end architecture, composed of a dual deep neural network based system is proposed for automatic recognition of use by dates in food package photos. The system includes: a Global level convolutional neural network (CNN) for high-level food package image quality evaluation (blurry/clear/missing use by date statistics); a Local level fully convolutional network (FCN) for use by date ROI localisation. Post ROI extraction, the date characters are then segmented and recognised. The proposed framework is the first to employ deep neural networks for end-to-end automatic use by date recognition in retail packaging photos. It is capable of achieving very good levels of performance on all the aforementioned tasks, despite the varied textual/pictorial content complexity found in food packaging design

    Bi-directional block self-attention for fast and memory-efficient sequence modeling

    Full text link
    © Learning Representations, ICLR 2018 - Conference Track Proceedings.All right reserved. Recurrent neural networks (RNN), convolutional neural networks (CNN) and self-attention networks (SAN) are commonly used to produce context-aware representations. RNN can capture long-range dependency but is hard to parallelize and not time-efficient. CNN focuses on local dependency but does not perform well on some tasks. SAN can model both such dependencies via highly parallelizable computation, but memory requirement grows rapidly in line with sequence length. In this paper, we propose a model, called “bi-directional block self-attention network (Bi-BloSAN)”, for RNN/CNN-free sequence encoding. It requires as little memory as RNN but with all the merits of SAN. Bi-BloSAN splits the entire sequence into blocks, and applies an intra-block SAN to each block for modeling local context, then applies an inter-block SAN to the outputs for all blocks to capture long-range dependency. Thus, each SAN only needs to process a short sequence, and only a small amount of memory is required. Additionally, we use feature-level attention to handle the variation of contexts around the same word, and use forward/backward masks to encode temporal order information. On nine benchmark datasets for different NLP tasks, Bi-BloSAN achieves or improves upon state-of-the-art accuracy, and shows better efficiency-memory trade-off than existing RNN/CNN/SAN

    Optimising algorithm and hardware for deep neural networks on FPGAs

    Get PDF
    This thesis proposes novel algorithm and hardware optimisation approaches to accelerate Deep Neural Networks (DNNs), including both Convolutional Neural Networks (CNNs) and Bayesian Neural Networks (BayesNNs). The first contribution of this thesis is to propose an adaptable and reconfigurable hardware design to accelerate CNNs. By analysing the computational patterns of different CNNs, a unified hardware architecture is proposed for both 2-Dimension and 3-Dimension CNNs. The accelerator is also designed with runtime adaptability, which adopts different parallelism strategies for different convolutional layers at runtime. The second contribution of this thesis is to propose a novel neural network architecture and hardware design co-optimisation approach, which improves the performance of CNNs at both algorithm and hardware levels. Our proposed three-phase co-design framework decouples network training from design space exploration, which significantly reduces the time-cost of the co-optimisation process. The third contribution of this thesis is to propose an algorithmic and hardware co-optimisation framework for accelerating BayesNNs. At the algorithmic level, three categories of structured sparsity are explored to reduce the computational complexity of BayesNNs. At the hardware level, we propose a novel hardware architecture with the aim of exploiting the structured sparsity for BayesNNs. Both algorithmic and hardware optimisations are jointly applied to push the performance limit.Open Acces

    AELA-DLSTMs: Attention-enabled and location-aware double LSTMs for aspect-level sentiment classification

    Get PDF
    Aspect-level sentiment classification, as a fine-grained task in sentiment classification, aiming to extract sentiment polarity from opinions towards a specific aspect word, has been made tremendous improvements in recent years. There are three key factors for aspect-level sentiment classification: contextual semantic information towards aspect words, correlations between aspect words and their context words, and location information of context words with regard to aspect words. In this paper, two models named AE-DLSTMs (Attention-Enabled Double LSTMs) and AELA-DLSTMs (Attention-Enabled and Location-Aware Double LSTMs) are proposed for aspect-level sentiment classification. AE-DLSTMs takes full advantage of the DLSTMs (Double LSTMs) which can capture the contextual semantic information in both forward and backward directions towards aspect words. Meanwhile, a novel attention weights generating method that combines aspect words with their contextual semantic information is designed so that those weights can make better use of the correlations between aspect words and their context words. Besides, we observe that context words with different distances or different directions towards aspect words have different contributions in sentiment polarity. Based on AE-DLSTMs, the location information of context words by assigning different weights is incorporated in AELA-DLSTMs to improve the accuracy. Experiments are conducted on two English datasets and one Chinese dataset. The experimental results have confirmed that our models can make remarkable improvements and outperform all the baseline models in all datasets, improving the accuracy of 1.67 percent to 4.77 percent in different datasets compared with baseline models

    Onboard ship detection and pose estimation with deep learning

    Get PDF

    Learning Models For Corrupted Multi-Dimensional Data: Fundamental Limits And Algorithms

    Get PDF
    Developing machine learning models for unstructured multi-dimensional datasets such as datasets with unreliable labels and noisy multi-dimensional signals with or without missing information have becoming a central necessity. We are not always fortunate enough to get noise-free datasets for developing classification and representation models. Though there is a number of techniques available to deal with noisy datasets, these methods do not exploit the multi-dimensional structures of the signals, which could be used to improve the overall classification and representation performance of the model. In this thesis, we develop a Kronecker-structure (K-S) subspace model that exploits the multi-dimensional structure of the signal. First, we study the classification performance of K-S subspace models in two asymptotic regimes when the signal dimensions go to infinity and when the noise power tends to zero. We characterize the misclassification probability in terms of diversity order and we drive an exact expression for the diversity order. We further derive a tighter bound on misclassification probability in terms of pairwise geometry of the subspaces. The proposed scheme is optimal in most of the signal dimension regimes except in one regime where the signal dimension is less than twice the subspace dimension, however, hitting such a signal dimension regime is very rare in practice. We empirically show that the classification performance of K-S subspace models agrees with the diversity order analysis. We also develop an algorithm, Kronecker- Structured Learning of Discriminative Dictionaries (K-SLD2), for fast and compact K-S subspace learning for better classification and representation of multidimensional signals. We show that the K-SLD2 algorithm balances compact signal representation and good classification performance on synthetic and real-world datasets. Next, we develop a scheme to detect whether a given multi-dimensional signal with missing information lies on a given K-S subspace. We find that under some mild incoherence conditions we must observe ��(��1 log ��1) number of rows and ��(��2 log ��2) number of columns in order to detect the K-S subspace. In order to account for unreliable labels in datasets we present Nonlinear, Noise- aware, Quasiclustering (NNAQC), a method for learning deep convolutional networks from datasets corrupted by unknown label noise. We append a nonlinear noise model to a standard convolutional network, which is learned in tandem with the parameters of the network. Further, we train the network using a loss function that encourages the clustering of training images. We argue that the non-linear noise model, while not rigorous as a probabilistic model, results in a more effective denoising operator during backpropagation. We evaluate the performance of NNAQC on artificially injected label noise to MNIST, CIFAR-10, CIFAR-100, and ImageNet datasets and on a large-scale Clothing1M dataset with inherent label noise. We show that on all these datasets, NNAQC provides significantly improved classification performance over the state of the art and is robust to the amount of label noise and the training samples

    Enhancing Sharpness-Aware Optimization Through Variance Suppression

    Full text link
    Sharpness-aware minimization (SAM) has well documented merits in enhancing generalization of deep neural networks, even without sizable data augmentation. Embracing the geometry of the loss function, where neighborhoods of 'flat minima' heighten generalization ability, SAM seeks 'flat valleys' by minimizing the maximum loss caused by an adversary perturbing parameters within the neighborhood. Although critical to account for sharpness of the loss function, such an 'over-friendly adversary' can curtail the outmost level of generalization. The novel approach of this contribution fosters stabilization of adversaries through variance suppression (VaSSO) to avoid such friendliness. VaSSO's provable stability safeguards its numerical improvement over SAM in model-agnostic tasks, including image classification and machine translation. In addition, experiments confirm that VaSSO endows SAM with robustness against high levels of label noise.Comment: Accepted to NeurIPS 202

    Modelling Uncertainty in Black-box Classification Systems

    Get PDF
    [eng] Currently, thanks to the Big Data boom, the excellent results obtained by deep learning models and the strong digital transformation experienced over the last years, many companies have decided to incorporate machine learning models into their systems. Some companies have detected this opportunity and are making a portfolio of artificial intelligence services available to third parties in the form of application programming interfaces (APIs). Subsequently, developers include calls to these APIs to incorporate AI functionalities in their products. Although it is an option that saves time and resources, it is true that, in most cases, these APIs are displayed in the form of blackboxes, the details of which are unknown to their clients. The complexity of such products typically leads to a lack of control and knowledge of the internal components, which, in turn, can drive to potential uncontrolled risks. Therefore, it is necessary to develop methods capable of evaluating the performance of these black-boxes when applied to a specific application. In this work, we present a robust uncertainty-based method for evaluating the performance of both probabilistic and categorical classification black-box models, in particular APIs, that enriches the predictions obtained with an uncertainty score. This uncertainty score enables the detection of inputs with very confident but erroneous predictions while protecting against out of distribution data points when deploying the model in a productive setting. In the first part of the thesis, we develop a thorough revision of the concept of uncertainty, focusing on the uncertainty of classification systems. We review the existingrelated literature, describing the different approaches for modelling this uncertainty, its application to different use cases and some of its desirable properties. Next, we introduce the proposed method for modelling uncertainty in black-box settings. Moreover, in the last chapters of the thesis, we showcase the method applied to different domains, including NLP and computer vision problems. Finally, we include two reallife applications of the method: classification of overqualification in job descriptions and readability assessment of texts.[spa] La tesis propone un método para el cálculo de la incertidumbre asociada a las predicciones de APIs o librerías externas de sistemas de clasificación
    corecore