148 research outputs found
Hyper-VolTran: Fast and Generalizable One-Shot Image to 3D Object Structure via HyperNetworks
Solving image-to-3D from a single view is an ill-posed problem, and current
neural reconstruction methods addressing it through diffusion models still rely
on scene-specific optimization, constraining their generalization capability.
To overcome the limitations of existing approaches regarding generalization and
consistency, we introduce a novel neural rendering technique. Our approach
employs the signed distance function as the surface representation and
incorporates generalizable priors through geometry-encoding volumes and
HyperNetworks. Specifically, our method builds neural encoding volumes from
generated multi-view inputs. We adjust the weights of the SDF network
conditioned on an input image at test-time to allow model adaptation to novel
scenes in a feed-forward manner via HyperNetworks. To mitigate artifacts
derived from the synthesized views, we propose the use of a volume transformer
module to improve the aggregation of image features instead of processing each
viewpoint separately. Through our proposed method, dubbed as Hyper-VolTran, we
avoid the bottleneck of scene-specific optimization and maintain consistency
across the images generated from multiple viewpoints. Our experiments show the
advantages of our proposed approach with consistent results and rapid
generation
Negative Frames Matter in Egocentric Visual Query 2D Localization
The recently released Ego4D dataset and benchmark significantly scales and
diversifies the first-person visual perception data. In Ego4D, the Visual
Queries 2D Localization task aims to retrieve objects appeared in the past from
the recording in the first-person view. This task requires a system to
spatially and temporally localize the most recent appearance of a given object
query, where query is registered by a single tight visual crop of the object in
a different scene.
Our study is based on the three-stage baseline introduced in the Episodic
Memory benchmark. The baseline solves the problem by detection and tracking:
detect the similar objects in all the frames, then run a tracker from the most
confident detection result. In the VQ2D challenge, we identified two
limitations of the current baseline. (1) The training configuration has
redundant computation. Although the training set has millions of instances,
most of them are repetitive and the number of unique object is only around
14.6k. The repeated gradient computation of the same object lead to an
inefficient training; (2) The false positive rate is high on background frames.
This is due to the distribution gap between training and evaluation. During
training, the model is only able to see the clean, stable, and labeled frames,
but the egocentric videos also have noisy, blurry, or unlabeled background
frames. To this end, we developed a more efficient and effective solution.
Concretely, we bring the training loop from ~15 days to less than 24 hours, and
we achieve 0.17% spatial-temporal AP, which is 31% higher than the baseline.
Our solution got the first ranking on the public leaderboard. Our code is
publicly available at https://github.com/facebookresearch/vq2d_cvpr.Comment: First place winning solution for VQ2D task in CVPR-2022 Ego4D
Challenge. Our code is publicly available at
https://github.com/facebookresearch/vq2d_cvp
Where is my Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization
This paper deals with the problem of localizing objects in image and video
datasets from visual exemplars. In particular, we focus on the challenging
problem of egocentric visual query localization. We first identify grave
implicit biases in current query-conditioned model design and visual query
datasets. Then, we directly tackle such biases at both frame and object set
levels. Concretely, our method solves these issues by expanding limited
annotations and dynamically dropping object proposals during training.
Additionally, we propose a novel transformer-based module that allows for
object-proposal set context to be considered while incorporating query
information. We name our module Conditioned Contextual Transformer or
CocoFormer. Our experiments show the proposed adaptations improve egocentric
query detection, leading to a better visual query localization system in both
2D and 3D configurations. Thus, we are able to improve frame-level detection
performance from 26.28% to 31.26 in AP, which correspondingly improves the VQ2D
and VQ3D localization scores by significant margins. Our improved context-aware
query object detector ranked first and second in the VQ2D and VQ3D tasks in the
2nd Ego4D challenge. In addition to this, we showcase the relevance of our
proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA
results. Our code is available at
https://github.com/facebookresearch/vq2d_cvpr.Comment: We ranked first and second in the VQ2D and VQ3D tasks in the 2nd
Ego4D challeng
Multi-Modal Few-Shot Temporal Action Detection via Vision-Language Meta-Adaptation
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for
scaling temporal action detection (TAD) to new classes. The former adapts a
pretrained vision model to a new task represented by as few as a single video
per class, whilst the latter requires no training examples by exploiting a
semantic description of the new class. In this work, we introduce a new
multi-modality few-shot (MMFS) TAD problem, which can be considered as a
marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new
class names jointly. To tackle this problem, we further introduce a novel
MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by
efficiently bridging pretrained vision and language models whilst maximally
reusing already learned capacity. Concretely, we construct multi-modal prompts
by mapping support videos into the textual token space of a vision-language
model using a meta-learned adapter-equipped visual semantics tokenizer. To
tackle large intra-class variation, we further design a query feature
regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14
demonstrate that our MUPPET outperforms state-of-the-art alternative methods,
often by a large margin. We also show that our MUPPET can be easily extended to
tackle the few-shot object detection problem and again achieves the
state-of-the-art performance on MS-COCO dataset. The code will be available in
https://github.com/sauradip/MUPPETComment: Technical Repor
- …