14 research outputs found
EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras
Event-based cameras have shown great promise in a variety of situations where
frame based cameras suffer, such as high speed motions and high dynamic range
scenes. However, developing algorithms for event measurements requires a new
class of hand crafted algorithms. Deep learning has shown great success in
providing model free solutions to many problems in the vision community, but
existing networks have been developed with frame based images in mind, and
there does not exist the wealth of labeled data for events as there does for
images for supervised training. To these points, we present EV-FlowNet, a novel
self-supervised deep learning pipeline for optical flow estimation for event
based cameras. In particular, we introduce an image based representation of a
given event stream, which is fed into a self-supervised neural network as the
sole input. The corresponding grayscale images captured from the same camera at
the same time as the events are then used as a supervisory signal to provide a
loss function at training time, given the estimated flow from the network. We
show that the resulting network is able to accurately predict optical flow from
events only in a variety of different scenes, with performance competitive to
image based networks. This method not only allows for accurate estimation of
dense optical flow, but also provides a framework for the transfer of other
self-supervised methods to the event-based domain.Comment: 9 pages, 5 figures, 1 table. Accompanying video:
https://youtu.be/eMHZBSoq0sE. Dataset:
https://daniilidis-group.github.io/mvsec/, Robotics: Science and Systems 201
Unsupervised Event-based Learning of Optical Flow, Depth, and Egomotion
In this work, we propose a novel framework for unsupervised learning for
event cameras that learns motion information from only the event stream. In
particular, we propose an input representation of the events in the form of a
discretized volume that maintains the temporal distribution of the events,
which we pass through a neural network to predict the motion of the events.
This motion is used to attempt to remove any motion blur in the event image. We
then propose a loss function applied to the motion compensated event image that
measures the motion blur in this image. We train two networks with this
framework, one to predict optical flow, and one to predict egomotion and
depths, and evaluate these networks on the Multi Vehicle Stereo Event Camera
dataset, along with qualitative results from a variety of different scenes.Comment: 9 pages, 7 figure
Unified Visual Relationship Detection with Vision and Language Models
This work focuses on training a single visual relationship detector
predicting over the union of label spaces from multiple datasets. Merging
labels spanning different datasets could be challenging due to inconsistent
taxonomies. The issue is exacerbated in visual relationship detection when
second-order visual semantics are introduced between pairs of objects. To
address this challenge, we propose UniVRD, a novel bottom-up method for Unified
Visual Relationship Detection by leveraging vision and language models (VLMs).
VLMs provide well-aligned image and text embeddings, where similar
relationships are optimized to be close to each other for semantic unification.
Our bottom-up design enables the model to enjoy the benefit of training with
both object detection and visual relationship datasets. Empirical results on
both human-object interaction detection and scene-graph generation demonstrate
the competitive performance of our model. UniVRD achieves 38.07 mAP on
HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP.
More importantly, we show that our unified detector performs as well as
dataset-specific models in mAP, and achieves further improvements when we scale
up the model. Our code will be made publicly available on GitHub.Comment: Accepted to ICCV 2023. Code is available at
https://github.com/google-research/scenic/tree/main/scenic/projects/univr
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Most of existing video-language pre-training methods focus on instance-level
alignment between video clips and captions via global contrastive learning but
neglect rich fine-grained local information, which is of importance to
downstream tasks requiring temporal localization and semantic reasoning. In
this work, we propose a simple yet effective video-language pre-training
framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two
novel designs involving spatiotemporal grounding and temporal grouping promote
learning local region-noun alignment and temporal-aware features
simultaneously. Specifically, spatiotemporal grounding aggregates semantically
similar video tokens and aligns them with noun phrases extracted from the
caption to promote local region-noun correspondences. Moreover, temporal
grouping leverages cut-and-paste to manually create temporal scene changes and
then learns distinguishable features from different scenes. Comprehensive
evaluations demonstrate that G-ViLM performs favorably against existing
approaches on four representative downstream tasks, covering text-video
retrieval, video question answering, video action recognition and temporal
action localization. G-ViLM performs competitively on all evaluated tasks and
in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9%
higher than the state-of-the-art method
PolyMaX: General Dense Prediction with Mask Transformer
Dense prediction tasks, such as semantic segmentation, depth estimation, and
surface normal prediction, can be easily formulated as per-pixel classification
(discrete outputs) or regression (continuous outputs). This per-pixel
prediction paradigm has remained popular due to the prevalence of fully
convolutional networks. However, on the recent frontier of segmentation task,
the community has been witnessing a shift of paradigm from per-pixel prediction
to cluster-prediction with the emergence of transformer architectures,
particularly the mask transformers, which directly predicts a label for a mask
instead of a pixel. Despite this shift, methods based on the per-pixel
prediction paradigm still dominate the benchmarks on the other dense prediction
tasks that require continuous outputs, such as depth estimation and surface
normal prediction. Motivated by the success of DORN and AdaBins in depth
estimation, achieved by discretizing the continuous output space, we propose to
generalize the cluster-prediction based method to general dense prediction
tasks. This allows us to unify dense prediction tasks with the mask transformer
framework. Remarkably, the resulting model PolyMaX demonstrates
state-of-the-art performance on three benchmarks of NYUD-v2 dataset. We hope
our simple yet effective design can inspire more research on exploiting mask
transformers for more dense prediction tasks. Code and model will be made
available.Comment: WACV 202