5 research outputs found
Identifying Abusive Videos Inserted In A Video
Disclosed herein is an improved mechanism for identifying abusive videos that have been inserted in a video. The mechanism can identify a video to be analyzed. The mechanism can then identify, for each frame of the video, a group of candidate locations within the video at which a two-dimensional video is likely to be embedded. The system can cluster the identified candidate locations and identify a subset of the candidate locations based on the clusters. The system can then identify a two-dimensional video embedded at one of the locations included in subset of candidate locations
Budgeted sensor placement for source localization on trees
We address the problem of choosing a fixed number of sensor vertices in a graph in order to detect the source of a partially-observed diffusion process on the graph itself. Building on the definition of double resolvability we introduce a notion of vertex resolvability. For the case of tree graphs we give polynomial time algorithms for both finding the sensors that maximize the probability of correct detection of the source and for identifying the sensor set that minimizes the expected distance between the real source and the estimated one
FlexiViT: One Model for All Patch Sizes
Vision Transformers convert images to sequences by slicing them into patches.
The size of these patches controls a speed/accuracy tradeoff, with smaller
patches leading to higher accuracy at greater computational cost, but changing
the patch size typically requires retraining the model. In this paper, we
demonstrate that simply randomizing the patch size at training time leads to a
single set of weights that performs well across a wide range of patch sizes,
making it possible to tailor the model to different compute budgets at
deployment time. We extensively evaluate the resulting model, which we call
FlexiViT, on a wide range of tasks, including classification, image-text
retrieval, open-world detection, panoptic segmentation, and semantic
segmentation, concluding that it usually matches, and sometimes outperforms,
standard ViT models trained at a single patch size in an otherwise identical
setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that
makes it easy to add compute-adaptive capabilities to most models relying on a
ViT backbone architecture. Code and pre-trained models are available at
https://github.com/google-research/big_visionComment: Code and pre-trained models available at
https://github.com/google-research/big_vision. All authors made significant
technical contributions. CVPR 202
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
This paper presents PaLI-3, a smaller, faster, and stronger vision language
model (VLM) that compares favorably to similar models that are 10x larger. As
part of arriving at this strong performance, we compare Vision Transformer
(ViT) models pretrained using classification objectives to contrastively
(SigLIP) pretrained ones. We find that, while slightly underperforming on
standard image classification benchmarks, SigLIP-based PaLI shows superior
performance across various multimodal benchmarks, especially on localization
and visually-situated text understanding. We scale the SigLIP image encoder up
to 2 billion parameters, and achieves a new state-of-the-art on multilingual
cross-modal retrieval. We hope that PaLI-3, at only 5B parameters, rekindles
research on fundamental pieces of complex VLMs, and could fuel a new generation
of scaled-up models
PaLI-X: On Scaling up a Multilingual Vision and Language Model
We present the training recipe and results of scaling up PaLI-X, a
multilingual vision and language model, both in terms of size of the components
and the breadth of its training task mixture. Our model achieves new levels of
performance on a wide-range of varied and complex tasks, including multiple
image-based captioning and question-answering tasks, image-based document
understanding and few-shot (in-context) learning, as well as object detection,
video question answering, and video captioning. PaLI-X advances the
state-of-the-art on most vision-and-language benchmarks considered (25+ of
them). Finally, we observe emerging capabilities, such as complex counting and
multilingual object detection, tasks that are not explicitly in the training
mix