19 research outputs found

    Self Supervision Does Not Help Natural Language Supervision at Scale

    Full text link
    Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE and SLIP have suggested that these approaches can be effectively combined, but most notably their results use small pre-training datasets (<50M samples) and don't effectively reflect the large-scale regime (>100M examples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE and contrastive language image pre-training, CLIP provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training

    Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models

    No full text
    Deep neural networks have completely revolutionized the field of machine learning by achieving state-of-the-art results on various tasks ranging from computer vision to protein folding. However, their application is hindered by their large computational and memory requirements. In this thesis, we propose methods for improving the efficiency of deep neural networks. Firstly, we tackle the sample inefficiency of neural network training with an importance sampling algorithm suitable for deep neural networks. This algorithm allows us to focus computation on datapoints that are going to provide useful gradients for training our models and ignore the ones that will have negligible gradients. We show that our algorithm can improve the performance of various neural networks when compared to uniform sampling under a fixed computational budget. Secondly, we design a model that is suitable for processing large input images with a fraction of the computational and memory requirements of traditional approaches. We achieve this by sampling from a data-dependent attention distribution in order to only process a portion of the input in high resolution. We demonstrate that our model can learn both the attention and the features in an end-to-end fashion using only single image-wise labels for supervision. Subsequently, we shift our attention to transformer architectures and introduce a kernelized formulation for self-attention that reduces its quadratic complexity to linear with respect to the input sequence's length. Furthermore, we uncover the relationship between autoregressive transformers and recurrent neural networks and show that our formulation enables up to 3 orders of magnitude faster autoregressive inference. Finally, we develop clustered, attention a method that can approximate softmax transformers with reduced computation. This is achieved by grouping elements of the input using clustering. We showcase that our formulation provides a better trade-off between performance and computation in comparison to the original transformer architecture. In addition, we demonstrate that clustered attention can approximate pretrained transformer models without any fine-tuning and with minimal loss in performance.LIDIA

    Improving the Convergence Speed of Deep Neural Networks with Biased Sampling

    No full text

    Kediktatoran tokoh puan tirana sang penguasa yang buta dalam roman negeri senja karangan Seno Gumira Ajidarma

    Get PDF
    Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images

    Not All Samples Are Created Equal: Deep Learning with Importance Sampling

    No full text
    Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored. We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on "informative" examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup. The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%
    corecore