37 research outputs found
Recommended from our members
A contrario patch matching, with an application to keypoint matches validation
We describe a simple metric for image patches similarity, together with a robust criterion for unsupervised patch matching. The gradient orientations at corresponding positions in the two patches are compared and the normalized errors are accumulated. Based on the a contrario framework, the matching criterion validates a match between two patches when this cumulative error is too small to have occurred as the result of an accidental agreement. The method is illustrated in the validation of keypoint matches.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ICIP.2015.735093
Affine invariant visual phrases for object instance recognition
Object instance recognition approaches based on the
bag-of-words model are severely affected by the loss of
spatial consistency during retrieval. As a result, costly
RANSAC verification is needed to ensure geometric
consistency between the query and the retrieved images.
A common alternative is to inject geometric informa-
tion directly into the retrieval procedure, by endowing
the visual words with additional information. Most of
the existing approaches in this category can efficiently
handle only restricted classes of geometric transfor-
mations, including scale and translation. In this pa-
per, we propose a simple and efficient scheme that can
cover the more complex class of full affine transforma-
tions. We demonstrate the usefulness of our approach
in the case of planar object instance recognition, such
as recognition of books, logos, traffic signs, etc.This work was funded by a Google Faculty Research
Award, the Marie Curie grant CIG-334283-HRGP, a
CNRS chaire d'excellence.This is the author accepted manuscript. The final version is available at http://dx.doi.org/10.1109/MVA.2015.715312
Spatio-temporal video autoencoder with differentiable memory
We describe a new spatio-temporal video autoencoder, based on a classic
spatial image autoencoder and a novel nested temporal autoencoder. The temporal
encoder is represented by a differentiable visual memory composed of
convolutional long short-term memory (LSTM) cells that integrate changes over
time. Here we target motion changes and use as temporal decoder a robust
optical flow prediction module together with an image sampler serving as
built-in feedback loop. The architecture is end-to-end differentiable. At each
time step, the system receives as input a video frame, predicts the optical
flow based on the current observation and the LSTM memory state as a dense
transformation map, and applies it to the current frame to generate the next
frame. By minimising the reconstruction error between the predicted next frame
and the corresponding ground truth next frame, we train the whole system to
extract features useful for motion estimation without any supervision effort.
We present one direct application of the proposed framework in
weakly-supervised semantic segmentation of videos through label propagation
using optical flow
Joint A Contrario Ellipse and Line Detection.
This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TPAMI.2016.2558150We propose a line segment and elliptical arc detector that produces a reduced number of false detections on various types of images without any parameter tuning. For a given region of pixels in a grey-scale image, the detector decides whether a line segment or an elliptical arc is present (model validation). If both interpretations are possible for the same region, the detector chooses the one that best explains the data (model selection ). We describe a statistical criterion based on the a contrario theory, which serves for both validation and model selection. The experimental results highlight the performance of the proposed approach compared to state-of-the-art detectors, when applied on synthetic and real images.This work was partially funded by the Qualcomm postdoctoral program at École Polytechnique Palaiseau, a Google Faculty Research Award, the Marie Curie grant CIG-334283-HRGP, a CNRS chaire d’excellence and chaire Jean Marjoulet, and EPSRC grant EP/L010917/1
Perception Test:A Diagnostic Benchmark for Multimodal Video Models
We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 45.8%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_tes
SceneNet: Understanding Real World Indoor Scenes With Synthetic Data
Scene understanding is a prerequisite to many high level tasks for any
automated intelligent machine operating in real world environments. Recent
attempts with supervised learning have shown promise in this direction but also
highlighted the need for enormous quantity of supervised data --- performance
increases in proportion to the amount of data used. However, this quickly
becomes prohibitive when considering the manual labour needed to collect such
data. In this work, we focus our attention on depth based semantic per-pixel
labelling as a scene understanding problem and show the potential of computer
graphics to generate virtually unlimited labelled data from synthetic 3D
scenes. By carefully synthesizing training data with appropriate noise models
we show comparable performance to state-of-the-art RGBD systems on NYUv2
dataset despite using only depth data as input and set a benchmark on
depth-based segmentation on SUN RGB-D dataset. Additionally, we offer a route
to generating synthesized frame or video data, and understanding of different
factors influencing performance gains
SynthCam3D: Semantic Understanding With Synthetic Indoor Scenes
We are interested in automatic scene understanding from geometric cues. To
this end, we aim to bring semantic segmentation in the loop of real-time
reconstruction. Our semantic segmentation is built on a deep autoencoder stack
trained exclusively on synthetic depth data generated from our novel 3D scene
library, SynthCam3D. Importantly, our network is able to segment real world
scenes without any noise modelling. We present encouraging preliminary results
A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames
Understanding long, real-world videos requires modeling of long-range visual
dependencies. To this end, we explore video-first architectures, building on
the common paradigm of transferring large-scale, image--text models to video
via shallow temporal fusion. However, we expose two limitations to the
approach: (1) decreased spatial capabilities, likely due to poor
video--language alignment in standard video datasets, and (2) higher memory
consumption, bottlenecking the number of frames that can be processed. To
mitigate the memory bottleneck, we systematically analyze the memory/accuracy
trade-off of various efficient methods: factorized attention,
parameter-efficient image-to-video adaptation, input masking, and
multi-resolution patchification. Surprisingly, simply masking large portions of
the video (up to 75%) during contrastive pre-training proves to be one of the
most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our
simple approach for training long video-to-text models, which scales to 1B
parameters, does not add new architectural complexity and is able to outperform
the popular paradigm of using much larger LLMs as an information aggregator
over segment-based information on benchmarks with long-range temporal
dependencies (YouCook2, EgoSchema)