37 research outputs found
Recommended from our members
Efficient Bayesian active learning and matrix modelling
With the advent of the Internet and growth of storage capabilities, large collections of unlabelled data are now available. However, collecting supervised labels can be costly. Active learning addresses this by selecting, sequentially, only the most useful data in light of the information collected so far. The online nature of such algorithms often necessitates efficient computations. Thus, we present a framework for information theoretic Bayesian active learning, named Bayesian Active Learning by Disagreement, that permits efficient and accurate computations of data utility. Using this framework we develop new techniques for active Gaussian process modelling and adaptive quantum tomography. The latter has been shown, in both simulation and laboratory experiments, to yield faster learning rates than any non-adaptive design.
Numerous datasets can be represented as matrices. Bayesian models of matrices are becoming increasingly popular because they can handle noisy or missing elements, and are extensible to different data-types. However, efficient inference is crucial to allow these flexible probabilistic models to scale to large real-world datasets. Binary matrices are a ubiquitous datatype, so we present a stochastic inference algorithm for fast learning in this domain. Preference judgements are a common, implicit source of binary data. We present a hybrid matrix factorization/Gaussian process model for collaborative learning from multiple users' preferences. This model exploits both the structure of the matrix and can incorporate additional covariate information to make accurate predictions.
We then combine matrix modelling with active learning and propose a new algorithm for cold-start learning with ordinal data, such as ratings. This algorithm couples Bayesian Active Learning by Disagreement with a heteroscedastic model to handle varying levels of noise. This ordinal matrix model is also used to analyze psychometric questionnaires; we analyze classical assumptions made in psychometrics and show that active learning methods can reduce questionnaire lengths substantially.This PhD was supported by the Google European Doctoral Fellowshi
CLIPPO: Image-and-Language Understanding from Pixels Only
Multimodal models are becoming increasingly effective, in part due to unified
components, such as the Transformer architecture. However, multimodal models
still often consist of many task- and modality-specific pieces and training
procedures. For example, CLIP (Radford et al., 2021) trains independent text
and image towers via a contrastive loss. We explore an additional unification:
the use of a pure pixel-based model to perform image, text, and multimodal
tasks. Our model is trained with contrastive loss alone, so we call it
CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both
regular images and text rendered as images. CLIPPO performs image-based tasks
such as retrieval and zero-shot image classification almost as well as
CLIP-style models, with half the number of parameters and no text-specific
tower or embedding. When trained jointly via image-text contrastive learning
and next-sentence contrastive learning, CLIPPO can perform well on natural
language understanding tasks, without any word-level loss (language modelling
or masked language modelling), outperforming pixel-based prior work.
Surprisingly, CLIPPO can obtain good accuracy in visual question answering,
simply by rendering the question and image together. Finally, we exploit the
fact that CLIPPO does not require a tokenizer to show that it can achieve
strong performance on multilingual multimodal retrieval without modifications.Comment: CVPR 2023. Code and pretrained models are available at
https://github.com/google-research/big_vision/blob/main/big_vision/configs/proj/clippo/README.m
Scaling Open-Vocabulary Object Detection
Open-vocabulary object detection has benefited greatly from pretrained
vision-language models, but is still limited by the amount of available
detection training data. While detection training data can be expanded by using
Web image-text pairs as weak supervision, this has not been done at scales
comparable to image-level pretraining. Here, we scale up detection data with
self-training, which uses an existing detector to generate pseudo-box
annotations on image-text pairs. Major challenges in scaling self-training are
the choice of label space, pseudo-annotation filtering, and training
efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which
address these challenges. OWLv2 surpasses the performance of previous
state-of-the-art open-vocabulary detectors already at comparable training
scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples,
yielding further large improvement: With an L/14 architecture, OWL-ST improves
AP on LVIS rare classes, for which the model has seen no human box annotations,
from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale
training for open-world localization, similar to what has been seen for image
classification and language modelling