19 research outputs found
Self Supervision Does Not Help Natural Language Supervision at Scale
Self supervision and natural language supervision have emerged as two
exciting ways to train general purpose image encoders which excel at a variety
of downstream tasks. Recent works such as M3AE and SLIP have suggested that
these approaches can be effectively combined, but most notably their results
use small pre-training datasets (<50M samples) and don't effectively reflect
the large-scale regime (>100M examples) that is commonly used for these
approaches. Here we investigate whether a similar approach can be effective
when trained with a much larger amount of data. We find that a combination of
two state of the art approaches: masked auto-encoders, MAE and contrastive
language image pre-training, CLIP provides a benefit over CLIP when trained on
a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a
suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B
images. Our work provides some much needed clarity into the effectiveness (or
lack thereof) of self supervision for large-scale image-text training
Stop Wasting my FLOPS: Improving the Efficiency of Deep Learning Models
Deep neural networks have completely revolutionized the field of machine
learning by achieving state-of-the-art results on various tasks ranging from
computer vision to protein folding. However, their application is hindered by
their large computational and memory requirements. In this thesis, we propose
methods for improving the efficiency of deep neural networks.
Firstly, we tackle the sample inefficiency of neural network training with an
importance sampling algorithm suitable for deep neural networks. This algorithm
allows us to focus computation on datapoints that are going to provide useful
gradients for training our models and ignore the ones that will have negligible
gradients. We show that our algorithm can improve the performance of various
neural networks when compared to uniform sampling under a fixed computational
budget.
Secondly, we design a model that is suitable for processing large input images
with a fraction of the computational and memory requirements of traditional
approaches. We achieve this by sampling from a data-dependent attention
distribution in order to only process a portion of the input in high
resolution. We demonstrate that our model can learn both the attention and the
features in an end-to-end fashion using only single image-wise labels for
supervision.
Subsequently, we shift our attention to transformer architectures and introduce
a kernelized formulation for self-attention that reduces its quadratic
complexity to linear with respect to the input sequence's length. Furthermore,
we uncover the relationship between autoregressive transformers and recurrent
neural networks and show that our formulation enables up to 3 orders of
magnitude faster autoregressive inference.
Finally, we develop clustered, attention a method that can approximate softmax
transformers with reduced computation. This is achieved by grouping elements of
the input using clustering. We showcase that our formulation provides a better
trade-off between performance and computation in comparison to the original
transformer architecture. In addition, we demonstrate that clustered attention
can approximate pretrained transformer models without any fine-tuning and with
minimal loss in performance.LIDIA
Kediktatoran tokoh puan tirana sang penguasa yang buta dalam roman negeri senja karangan Seno Gumira Ajidarma
Existing deep architectures cannot operate on very large signals such as megapixel images due to computational and memory constraints. To tackle this limitation, we propose a fully differentiable end-to-end trainable model that samples and processes only a fraction of the full resolution input image. The locations to process are sampled from an attention distribution computed from a low resolution view of the input. We refer to our method as attention sampling and it can process images of several megapixels with a standard single GPU setup. We show that sampling from the attention distribution results in an unbiased estimator of the full model with minimal variance, and we derive an unbiased estimator of the gradient that we use to train our model end-to-end with a normal SGD procedure. This new method is evaluated on three classification tasks, where we show that it allows to reduce computation and memory footprint by an order of magnitude for the same accuracy as classical architectures. We also show the consistency of the sampling that indeed focuses on informative parts of the input images
Not All Samples Are Created Equal: Deep Learning with Importance Sampling
Deep neural network training spends most of the computation on
examples that are properly handled, and could be ignored.
We propose to mitigate this phenomenon with a principled importance
sampling scheme that focuses computation on "informative" examples,
and reduces the variance of the stochastic gradients during
training. Our contribution is twofold: first, we derive a tractable
upper bound to the per-sample gradient norm, and second we derive an
estimator of the variance reduction achieved with importance sampling,
which enables us to switch it on when it will result in an actual
speedup.
The resulting scheme can be used by changing a few lines of code in a standard
SGD procedure, and we demonstrate experimentally, on image classification, CNN
fine-tuning, and RNN training, that for a fixed wall-clock time budget, it
provides a reduction of the train losses of up to an order of magnitude and a
relative improvement of test errors between 5% and 17%
