22 research outputs found
Learning how to be robust: Deep polynomial regression
Polynomial regression is a recurrent problem with a large number of
applications. In computer vision it often appears in motion analysis. Whatever
the application, standard methods for regression of polynomial models tend to
deliver biased results when the input data is heavily contaminated by outliers.
Moreover, the problem is even harder when outliers have strong structure.
Departing from problem-tailored heuristics for robust estimation of parametric
models, we explore deep convolutional neural networks. Our work aims to find a
generic approach for training deep regression models without the explicit need
of supervised annotation. We bypass the need for a tailored loss function on
the regression parameters by attaching to our model a differentiable hard-wired
decoder corresponding to the polynomial operation at hand. We demonstrate the
value of our findings by comparing with standard robust regression methods.
Furthermore, we demonstrate how to use such models for a real computer vision
problem, i.e., video stabilization. The qualitative and quantitative
experiments show that neural networks are able to learn robustness for general
polynomial regression, with results that well overpass scores of traditional
robust estimation methods.Comment: 18 pages, conferenc
Incremental Few-Shot Object Detection
Most existing object detection methods rely on the availability of abundant
labelled training samples per class and offline model training in a batch mode.
These requirements substantially limit their scalability to open-ended
accommodation of novel classes with limited labelled training data. We present
a study aiming to go beyond these limitations by considering the Incremental
Few-Shot Detection (iFSD) problem setting, where new classes must be registered
incrementally (without revisiting base classes) and with few examples. To this
end we propose OpeN-ended Centre nEt (ONCE), a detector designed for
incrementally learning to detect novel class objects with few examples. This is
achieved by an elegant adaptation of the CentreNet detector to the few-shot
learning scenario, and meta-learning a class-specific code generator model for
registering novel classes. ONCE fully respects the incremental learning
paradigm, with novel class registration requiring only a single forward pass of
few-shot training samples, and no access to base classes -- thus making it
suitable for deployment on embedded devices. Extensive experiments conducted on
both the standard object detection and fashion landmark detection tasks show
the feasibility of iFSD for the first time, opening an interesting and very
important line of research.Comment: CVPR 202
Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks
Solving image-to-3D from a single view is an ill-posed problem, and current
neural reconstruction methods addressing it through diffusion models still rely
on scene-specific optimization, constraining their generalization capability.
To overcome the limitations of existing approaches regarding generalization and
consistency, we introduce a novel neural rendering technique. Our approach
employs the signed distance function as the surface representation and
incorporates generalizable priors through geometry-encoding volumes and
HyperNetworks. Specifically, our method builds neural encoding volumes from
generated multi-view inputs. We adjust the weights of the SDF network
conditioned on an input image at test-time to allow model adaptation to novel
scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts
derived from the synthesized views, we propose the use of a volume transformer
module to improve the aggregation of image features instead of processing each
viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we
avoid the bottleneck of scene-specific optimization and maintain consistency
across the images generated from multiple viewpoints. Our experiments show the
advantages of our proposed approach with consistent results and rapid
generation
Negative Frames Matter in Egocentric Visual Query 2D Localization
The recently released Ego4D dataset and benchmark significantly scales and
diversifies the first-person visual perception data. In Ego4D, the Visual
Queries 2D Localization task aims to retrieve objects appeared in the past from
the recording in the first-person view. This task requires a system to
spatially and temporally localize the most recent appearance of a given object
query, where query is registered by a single tight visual crop of the object in
a different scene.
Our study is based on the three-stage baseline introduced in the Episodic
Memory benchmark. The baseline solves the problem by detection and tracking:
detect the similar objects in all the frames, then run a tracker from the most
confident detection result. In the VQ2D challenge, we identified two
limitations of the current baseline. (1) The training configuration has
redundant computation. Although the training set has millions of instances,
most of them are repetitive and the number of unique object is only around
14.6k. The repeated gradient computation of the same object lead to an
inefficient training; (2) The false positive rate is high on background frames.
This is due to the distribution gap between training and evaluation. During
training, the model is only able to see the clean, stable, and labeled frames,
but the egocentric videos also have noisy, blurry, or unlabeled background
frames. To this end, we developed a more efficient and effective solution.
Concretely, we bring the training loop from ~15 days to less than 24 hours, and
we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline.
Our solution got the first ranking on the public leaderboard. Our code is
publicly available at https://github.com/facebookresearch/vq2d_cvpr.Comment: First place winning solution for VQ2D task in CVPR-2022 Ego4D
Challenge. Our code is publicly available at
https://github.com/facebookresearch/vq2d_cvp
Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
This paper deals with the problem of localizing objects in image and video
datasets from visual exemplars. In particular, we focus on the challenging
problem of egocentric visual query localization. We first identify grave
implicit biases in current query-conditioned model design and visual query
datasets. Then, we directly tackle such biases at both frame and object set
levels. Concretely, our method solves these issues by expanding limited
annotations and dynamically dropping object proposals during training.
Additionally, we propose a novel transformer-based module that allows for
object-proposal set context to be considered while incorporating query
information. We name our module Conditioned Contextual Transformer or
CocoFormer. Our experiments show the proposed adaptations improve egocentric
query detection, leading to a better visual query localization system in both
2D and 3D configurations. Thus, we are able to improve frame-level detection
performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D
and VQ3D localization scores by significant margins. Our improved context-aware
query object detector ranked first and second in the VQ2D and VQ3D tasks in the
2nd Ego4D challenge. In addition to this, we showcase the relevance of our
proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA
results. Our code is available at
https://github.com/facebookresearch/vq2d_cvpr.Comment: We ranked first and second in the VQ2D and VQ3D tasks in the 2nd
Ego4D challeng