Search CORE

105 research outputs found

Scaling Open-Vocabulary Object Detection

Author: Gritsenko Alexey
Houlsby Neil
Minderer Matthias
Publication venue
Publication date: 20/07/2023
Field of study

Open-vocabulary object detection has benefited greatly from pretrained vision-language models, but is still limited by the amount of available detection training data. While detection training data can be expanded by using Web image-text pairs as weak supervision, this has not been done at scales comparable to image-level pretraining. Here, we scale up detection data with self-training, which uses an existing detector to generate pseudo-box annotations on image-text pairs. Major challenges in scaling self-training are the choice of label space, pseudo-annotation filtering, and training efficiency. We present the OWLv2 model and OWL-ST self-training recipe, which address these challenges. OWLv2 surpasses the performance of previous state-of-the-art open-vocabulary detectors already at comparable training scales (~10M examples). However, with OWL-ST, we can scale to over 1B examples, yielding further large improvement: With an L/14 architecture, OWL-ST improves AP on LVIS rare classes, for which the model has seen no human box annotations, from 31.2% to 44.6% (43% relative improvement). OWL-ST unlocks Web-scale training for open-world localization, similar to what has been seen for image classification and language modelling

arXiv.org e-Print Archive

Video OWL-ViT: Temporally-consistent open-world localization in video

Author: Bewley Alex
Gritsenko Alexey
Heigold Georg
Keysers Daniel
Kipf Thomas
Lučić Mario
Minderer Matthias
Yu Fisher
Publication venue
Publication date: 21/08/2023
Field of study

We present an architecture and a training recipe that adapts pre-trained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pre-training, can be transferred successfully to open-world localization across diverse videos.Comment: ICCV 202

arXiv.org e-Print Archive

Adaptation of short-term plasticity parameters via error-driven learning may explain the correlation between activity-dependent synaptic properties, connectivity motifs and target specificity.

Author: Asaad
Barak
Binzegger
Blackman
Blatow
Bozdagi
Buchanan
Buonomano
Carvalho
Chklovskii
Chklovskii
Clopath
Costa
Dean
Deng
Denk
Douglas
Douglas
Eleni Vasilaki
Esposito
Fiorillo
Friston
Fuhrmann
Fusi
Grillner
GÃ¼tig
Hai
Hennig
Hertz
Izhikevich
Kandell
Klyachko
Le Be'
Legenstein
Lichtman
Maass
Markram
Markram
Markram
Markram
Markram
Matveev
Michele Giugliano
Minderer
NatschlÃ¤ger
Perin
Pfister
Pfister
Pignatelli
Reyes
Richmond
Rinaldi
Romani
Rotman
Schultz
Seung
Silberberg
Silberberg
SjÃ¶strÃ¶m
SjÃ¶strÃ¶m
SjÃ¶strÃ¶m
Song
Song
Testa-Silva
Thomson
Tobler
Tsodyks
Umberto Esposito
Urbanczik
Varela
Vasilaki
Vasilaki
Vasilaki
Vasilaki
Wang
Wedeen
Wickersham
Zhang
Zucker
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2015
Field of study

The anatomical connectivity among neurons has been experimentally found to be largely non-random across brain areas. This means that certain connectivity motifs occur at a higher frequency than would be expected by chance. Of particular interest, short-term synaptic plasticity properties were found to colocalize with specific motifs: an over-expression of bidirectional motifs has been found in neuronal pairs where short-term facilitation dominates synaptic transmission among the neurons, whereas an over-expression of unidirectional motifs has been observed in neuronal pairs where short-term depression dominates. In previous work we found that, given a network with fixed short-term properties, the interaction between short- and long-term plasticity of synaptic transmission is sufficient for the emergence of specific motifs. Here, we introduce an error-driven learning mechanism for short-term plasticity that may explain how such observed correspondences develop from randomly initialized dynamic synapses. By allowing synapses to change their properties, neurons are able to adapt their own activity depending on an error signal. This results in more rich dynamics and also, provided that the learning mechanism is target-specific, leads to specialized groups of synapses projecting onto functionally different targets, qualitatively replicating the experimental results of Wang and collaborators

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Directory of Open Access Journals

Frontiers - Publisher Connector

PubMed Central

Institutional Repository Universiteit Antwerpen

Sissa Digital Library

White Rose Research Online

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia

FlexiViT: One Model for All Patch Sizes

Author: Alabdulmohsin Ibrahim
Beyer Lucas
Caron Mathilde
Izmailov Pavel
Kolesnikov Alexander
Kornblith Simon
Minderer Matthias
Pavetic Filip
Tschannen Michael
Zhai Xiaohua
Publication venue
Publication date: 23/03/2023
Field of study

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_visionComment: Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions. CVPR 202

arXiv.org e-Print Archive

Improving fine-grained understanding in image-text pre-training

Author: Bauer Matthias
Bica Ioana
Blundell Charles
Bošnjak Matko
Erdogan Goker
Gritsenko Alexey A.
Ilić Anastasija
Kaplanis Christos
Minderer Matthias
Mitrović Jovana
Pascanu Razvan
Publication venue
Publication date: 18/01/2024
Field of study

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text embeddings to learn representations that simultaneously encode global and local information. We thoroughly evaluate our proposed method and show improved performance over competing approaches both on image-level tasks relying on coarse-grained information, e.g. classification, as well as region-level tasks relying on fine-grained information, e.g. retrieval, object detection, and segmentation. Moreover, SPARC improves model faithfulness and captioning in foundational vision-language models.Comment: 26 page

arXiv.org e-Print Archive

Simple Open-Vocabulary Object Detection with Vision Transformers

Author: Arnab Anurag
Dehghani Mostafa
Dosovitskiy Alexey
Gritsenko Alexey
Houlsby Neil
Kipf Thomas
Mahendran Aravindh
Minderer Matthias
Neumann Maxim
Shen Zhuoran
Stone Austin
Wang Xiao
Weissenborn Dirk
Zhai Xiaohua
Publication venue
Publication date: 20/07/2022
Field of study

Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.Comment: ECCV 2022 camera-ready versio

arXiv.org e-Print Archive

Spatial cell firing during virtual navigation of open arenas by head-restrained mice

Author: Acharya
Aghajan
Aronov
Burgess
Chen
Climer
Cohen
Cushman
Danielson
Dombeck
Domnisoru
Fyhn
Hafting
Harvey
Heys
Hölscher
Jeewajee
Kadir
Kropff
Low
McFarland
McNaughton
McNaughton
Minderer
Morris
Muller
O'Keefe
Ravassard
Rivas
Royer
Russell
Sargolini
Schmidt-Hieber
Sławińska
Tcheang
Towse
Villette
Voigts
Publication venue: 'eLife Sciences Publications, Ltd'
Publication date: 18/06/2018
Field of study

We present a mouse virtual reality (VR) system which restrains head-movements to horizontal rotations, compatible with multi-photon imaging. This system allows expression of the spatial navigation and neuronal firing patterns characteristic of real open arenas (R). Comparing VR to R: place and grid, but not head-direction, cell firing had broader spatial tuning; place, but not grid, cell firing was more directional; theta frequency increased less with running speed; whereas increases in firing rates with running speed and place and grid cells' theta phase precession were similar. These results suggest that the omni-directional place cell firing in R may require local-cues unavailable in VR, and that the scale of grid and place cell firing patterns, and theta frequency, reflect translational motion inferred from both virtual (visual and proprioceptive) and real (vestibular translation and extra-maze) cues. By contrast, firing rates and theta phase precession appear to reflect visual and proprioceptive cues alone

Crossref

UCL Discovery

Queen Mary Research Online