Search CORE

37 research outputs found

Recommended from our members

A contrario patch matching, with an application to keypoint matches validation

Author: Patraucean V
Von Gioi RG
Publication venue: Proceedings - International Conference on Image Processing, ICIP
Publication date: 01/09/2015
Field of study

We describe a simple metric for image patches similarity, together with a robust criterion for unsupervised patch matching. The gradient orientations at corresponding positions in the two patches are compared and the normalized errors are accumulated. Based on the a contrario framework, the matching criterion validates a match between two patches when this cumulative error is too small to have occurred as the result of an accidental agreement. The method is illustrated in the validation of keypoint matches.This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/ICIP.2015.735093

Apollo (Cambridge)

Affine invariant visual phrases for object instance recognition

Author: Ovsjanikov M
Patraucean V
Publication venue: Proceedings of the 14th IAPR International Conference on Machine Vision Applications, MVA 2015
Publication date: 01/05/2015
Field of study

Object instance recognition approaches based on the bag-of-words model are severely affected by the loss of spatial consistency during retrieval. As a result, costly RANSAC verification is needed to ensure geometric consistency between the query and the retrieved images. A common alternative is to inject geometric informa- tion directly into the retrieval procedure, by endowing the visual words with additional information. Most of the existing approaches in this category can efficiently handle only restricted classes of geometric transfor- mations, including scale and translation. In this pa- per, we propose a simple and efficient scheme that can cover the more complex class of full affine transforma- tions. We demonstrate the usefulness of our approach in the case of planar object instance recognition, such as recognition of books, logos, traffic signs, etc.This work was funded by a Google Faculty Research Award, the Marie Curie grant CIG-334283-HRGP, a CNRS chaire d'excellence.This is the author accepted manuscript. The final version is available at http://dx.doi.org/10.1109/MVA.2015.715312

Crossref

Apollo (Cambridge)

Spatio-temporal video autoencoder with differentiable memory

Author: Cipolla Roberto
Handa Ankur
Patraucean Viorica
Publication venue
Publication date: 19/11/2015
Field of study

We describe a new spatio-temporal video autoencoder, based on a classic spatial image autoencoder and a novel nested temporal autoencoder. The temporal encoder is represented by a differentiable visual memory composed of convolutional long short-term memory (LSTM) cells that integrate changes over time. Here we target motion changes and use as temporal decoder a robust optical flow prediction module together with an image sampler serving as built-in feedback loop. The architecture is end-to-end differentiable. At each time step, the system receives as input a video frame, predicts the optical flow based on the current observation and the LSTM memory state as a dense transformation map, and applies it to the current frame to generate the next frame. By minimising the reconstruction error between the predicted next frame and the corresponding ground truth next frame, we train the whole system to extract features useful for motion estimation without any supervision effort. We present one direct application of the proposed framework in weakly-supervised semantic segmentation of videos through label propagation using optical flow

arXiv.org e-Print Archive

Apollo (Cambridge)

Joint A Contrario Ellipse and Line Detection.

Author: Grompone von Gioi Rafael
Gurdjos Pierre
Patraucean Viorica
Publication venue: IEEE Trans Pattern Anal Mach Intell
Publication date: 25/04/2016
Field of study

This is the author accepted manuscript. The final version is available from IEEE via http://dx.doi.org/10.1109/TPAMI.2016.2558150We propose a line segment and elliptical arc detector that produces a reduced number of false detections on various types of images without any parameter tuning. For a given region of pixels in a grey-scale image, the detector decides whether a line segment or an elliptical arc is present (model validation). If both interpretations are possible for the same region, the detector chooses the one that best explains the data (model selection ). We describe a statistical criterion based on the a contrario theory, which serves for both validation and model selection. The experimental results highlight the performance of the proposed approach compared to state-of-the-art detectors, when applied on synthetic and real images.This work was partially funded by the Qualcomm postdoctoral program at École Polytechnique Palaiseau, a Google Faculty Research Award, the Marie Curie grant CIG-334283-HRGP, a CNRS chaire d’excellence and chaire Jean Marjoulet, and EPSRC grant EP/L010917/1

Crossref

Scientific Publications of the University of Toulouse II Le Mirail

Apollo (Cambridge)

Perception Test:A Diagnostic Benchmark for Multimodal Video Models

Author: Carriera Joao
Damen Dima
Patraucean Viorica
Zisserman Andrew
Publication venue
Publication date: 16/12/2023
Field of study

We propose a novel multimodal video benchmark - the Perception Test - to evaluate the perception and reasoning skills of pre-trained multimodal models (e.g. Flamingo, BEiT-3, or GPT-4). Compared to existing benchmarks that focus on computational tasks (e.g. classification, detection or tracking), the Perception Test focuses on skills (Memory, Abstraction, Physics, Semantics) and types of reasoning (descriptive, explanatory, predictive, counterfactual) across video, audio, and text modalities, to provide a comprehensive and efficient evaluation tool. The benchmark probes pre-trained models for their transfer capabilities, in a zero-shot / few-shot or limited finetuning regime. For these purposes, the Perception Test introduces 11.6k real-world videos, 23s average length, designed to show perceptually interesting situations, filmed by around 100 participants worldwide. The videos are densely annotated with six types of labels (multiple-choice and grounded video question-answers, object and point tracks, temporal action and sound segments), enabling both language and non-language evaluations. The fine-tuning and validation splits of the benchmark are publicly available (CC-BY license), in addition to a challenge server with a held-out test split. Human baseline results compared to state-of-the-art video QA models show a significant gap in performance (91.4% vs 45.8%), suggesting that there is significant room for improvement in multimodal video understanding. Dataset, baselines code, and challenge server are available at https://github.com/deepmind/perception_tes

Explore Bristol Research

SceneNet: Understanding Real World Indoor Scenes With Synthetic Data

Author: Badrinarayanan Vijay
Cipolla Roberto
Handa Ankur
Patraucean Viorica
Stent Simon
Publication venue
Publication date: 22/11/2015
Field of study

Scene understanding is a prerequisite to many high level tasks for any automated intelligent machine operating in real world environments. Recent attempts with supervised learning have shown promise in this direction but also highlighted the need for enormous quantity of supervised data --- performance increases in proportion to the amount of data used. However, this quickly becomes prohibitive when considering the manual labour needed to collect such data. In this work, we focus our attention on depth based semantic per-pixel labelling as a scene understanding problem and show the potential of computer graphics to generate virtually unlimited labelled data from synthetic 3D scenes. By carefully synthesizing training data with appropriate noise models we show comparable performance to state-of-the-art RGBD systems on NYUv2 dataset despite using only depth data as input and set a benchmark on depth-based segmentation on SUN RGB-D dataset. Additionally, we offer a route to generating synthesized frame or video data, and understanding of different factors influencing performance gains

arXiv.org e-Print Archive

Apollo (Cambridge)

SynthCam3D: Semantic Understanding With Synthetic Indoor Scenes

Author: Badrinarayanan Vijay
Cipolla Roberto
Handa Ankur
Patraucean Viorica
Stent Simon
Publication venue
Publication date: 01/05/2015
Field of study

We are interested in automatic scene understanding from geometric cues. To this end, we aim to bring semantic segmentation in the loop of real-time reconstruction. Our semantic segmentation is built on a deep autoencoder stack trained exclusively on synthetic depth data generated from our novel 3D scene library, SynthCam3D. Importantly, our network is able to segment real world scenes without any noise modelling. We present encouraging preliminary results

arXiv.org e-Print Archive

Apollo (Cambridge)

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames

Author: Chiu Justin
Heyward Joe
Koppula Skanda
Miech Antoine
Nematzdeh Aida
Papalampidi Pinelopi
Pathak Shreya
Patraucean Viorica
Shen Jiajun
Zisserman Andrew
Publication venue
Publication date: 12/12/2023
Field of study

Understanding long, real-world videos requires modeling of long-range visual dependencies. To this end, we explore video-first architectures, building on the common paradigm of transferring large-scale, image--text models to video via shallow temporal fusion. However, we expose two limitations to the approach: (1) decreased spatial capabilities, likely due to poor video--language alignment in standard video datasets, and (2) higher memory consumption, bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck, we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention, parameter-efficient image-to-video adaptation, input masking, and multi-resolution patchification. Surprisingly, simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models, which scales to 1B parameters, does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2, EgoSchema)

arXiv.org e-Print Archive