Search CORE

5 research outputs found

Identifying Abusive Videos Inserted In A Video

Author: Pavetic Filip
Vorushin Roman
Publication venue: Technical Disclosure Commons
Publication date: 28/01/2019
Field of study

Disclosed herein is an improved mechanism for identifying abusive videos that have been inserted in a video. The mechanism can identify a video to be analyzed. The mechanism can then identify, for each frame of the video, a group of candidate locations within the video at which a two-dimensional video is likely to be embedded. The system can cluster the identified candidate locations and identify a subset of the candidate locations based on the clusters. The system can then identify a two-dimensional video embedded at one of the locations included in subset of candidate locations

Technical Disclosure Common

Budgeted sensor placement for source localization on trees

Author: Celis Elisa
Pavetic Filip
Spinelli Brunella Marta
Thiran Patrick
Publication venue: 'Elsevier BV'
Publication date: 10/08/2016
Field of study

We address the problem of choosing a fixed number of sensor vertices in a graph in order to detect the source of a partially-observed diffusion process on the graph itself. Building on the definition of double resolvability we introduce a notion of vertex resolvability. For the case of tree graphs we give polynomial time algorithms for both finding the sensors that maximize the probability of correct detection of the source and for identifying the sensor set that minimizes the expected distance between the real source and the estimated one

Infoscience - École polytechnique fédérale de Lausanne

FlexiViT: One Model for All Patch Sizes

Author: Alabdulmohsin Ibrahim
Beyer Lucas
Caron Mathilde
Izmailov Pavel
Kolesnikov Alexander
Kornblith Simon
Minderer Matthias
Pavetic Filip
Tschannen Michael
Zhai Xiaohua
Publication venue
Publication date: 23/03/2023
Field of study

Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, open-world detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pre-trained models are available at https://github.com/google-research/big_visionComment: Code and pre-trained models available at https://github.com/google-research/big_vision. All authors made significant technical contributions. CVPR 202

arXiv.org e-Print Archive

PaLI-3 Vision Language Models: Smaller, Faster, Stronger

Author: Alabdulmohsin Ibrahim
Beyer Lucas
Chen Xi
Goodman Sebastian
Keysers Daniel
Kolesnikov Alexander
Mustafa Basil
Padlewski Piotr
Pavetic Filip
Rong Keran
Salz Daniel
Soricut Radu
Vlasic Daniel
Voigtlaender Paul
Wang Xiao
Wu Jialin
Xiong Xi
Yu Tianli
Zhai Xiaohua
Publication venue
Publication date: 17/10/2023
Field of study

This paper presents PaLI-3, a smaller, faster, and stronger vision language model (VLM) that compares favorably to similar models that are 10x larger. As part of arriving at this strong performance, we compare Vision Transformer (ViT) models pretrained using classification objectives to contrastively (SigLIP) pretrained ones. We find that, while slightly underperforming on standard image classification benchmarks, SigLIP-based PaLI shows superior performance across various multimodal benchmarks, especially on localization and visually-situated text understanding. We scale the SigLIP image encoder up to 2 billion parameters, and achieves a new state-of-the-art on multilingual cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles research on fundamental pieces of complex VLMs, and could fuel a new generation of scaled-up models

arXiv.org e-Print Archive

PaLI-X: On Scaling up a Multilingual Vision and Language Model

We present the training recipe and results of scaling up PaLI-X, a multilingual vision and language model, both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks, including multiple image-based captioning and question-answering tasks, image-based document understanding and few-shot (in-context) learning, as well as object detection, video question answering, and video captioning. PaLI-X advances the state-of-the-art on most vision-and-language benchmarks considered (25+ of them). Finally, we observe emerging capabilities, such as complex counting and multilingual object detection, tasks that are not explicitly in the training mix

arXiv.org e-Print Archive