15 research outputs found
Forecasting Hands and Objects in Future Frames
This paper presents an approach to forecast future presence and location of
human hands and objects. Given an image frame, the goal is to predict what
objects will appear in the future frame (e.g., 5 seconds later) and where they
will be located at, even when they are not visible in the current frame. The
key idea is that (1) an intermediate representation of a convolutional object
recognition model abstracts scene information in its frame and that (2) we can
predict (i.e., regress) such representations corresponding to the future frames
based on that of the current frame. We design a new two-stream convolutional
neural network (CNN) architecture for videos by extending the state-of-the-art
convolutional object detection network, and present a new fully convolutional
regression network for predicting future scene representations. Our experiments
confirm that combining the regressed future representation with our detection
network allows reliable estimation of future hands and objects in videos. We
obtain much higher accuracy compared to the state-of-the-art future object
presence forecast method on a public dataset
Identifying First-person Camera Wearers in Third-person Videos
We consider scenarios in which we wish to perform joint scene understanding,
object tracking, activity recognition, and other tasks in environments in which
multiple people are wearing body-worn cameras while a third-person static
camera also captures the scene. To do this, we need to establish person-level
correspondences across first- and third-person videos, which is challenging
because the camera wearer is not visible from his/her own egocentric video,
preventing the use of direct feature matching. In this paper, we propose a new
semi-Siamese Convolutional Neural Network architecture to address this novel
challenge. We formulate the problem as learning a joint embedding space for
first- and third-person videos that considers both spatial- and motion-domain
cues. A new triplet loss function is designed to minimize the distance between
correct first- and third-person matches while maximizing the distance between
incorrect ones. This end-to-end approach performs significantly better than
several baselines, in part by learning the first- and third-person features
optimized for matching jointly with the distance measure itself
Multi-Task Spatiotemporal Neural Networks for Structured Surface Reconstruction
Deep learning methods have surpassed the performance of traditional
techniques on a wide range of problems in computer vision, but nearly all of
this work has studied consumer photos, where precisely correct output is often
not critical. It is less clear how well these techniques may apply on
structured prediction problems where fine-grained output with high precision is
required, such as in scientific imaging domains. Here we consider the problem
of segmenting echogram radar data collected from the polar ice sheets, which is
challenging because segmentation boundaries are often very weak and there is a
high degree of noise. We propose a multi-task spatiotemporal neural network
that combines 3D ConvNets and Recurrent Neural Networks (RNNs) to estimate ice
surface boundaries from sequences of tomographic radar images. We show that our
model outperforms the state-of-the-art on this problem by (1) avoiding the need
for hand-tuned parameters, (2) extracting multiple surfaces (ice-air and
ice-bed) simultaneously, (3) requiring less non-visual metadata, and (4) being
about 6 times faster.Comment: 10 pages, 7 figures, published in WACV 201
Projection Robust Wasserstein Distance and Riemannian Optimization
Projection robust Wasserstein (PRW) distance, or Wasserstein projection
pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work
suggests that this quantity is more robust than the standard Wasserstein
distance, in particular when comparing probability measures in high-dimensions.
However, it is ruled out for practical application because the optimization
model is essentially non-convex and non-smooth which makes the computation
intractable. Our contribution in this paper is to revisit the original
motivation behind WPP/PRW, but take the hard route of showing that, despite its
non-convexity and lack of nonsmoothness, and even despite some hardness results
proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original
formulation for PRW/WPP \textit{can} be efficiently computed in practice using
Riemannian optimization, yielding in relevant cases better behavior than its
convex relaxation. More specifically, we provide three simple algorithms with
solid theoretical guarantee on their complexity bound (one in the appendix),
and demonstrate their effectiveness and efficiency by conducing extensive
experiments on synthetic and real data. This paper provides a first step into a
computational theory of the PRW distance and provides the links between optimal
transport and Riemannian optimization.Comment: Accepted by NeurIPS 2020; The first two authors contributed equally;
fix the confusing parts in the proof and refine the algorithms and complexity
bound
Lifelong-MonoDepth: Lifelong Learning for Multi-Domain Monocular Metric Depth Estimation
With the rapid advancements in autonomous driving and robot navigation, there
is a growing demand for lifelong learning models capable of estimating metric
(absolute) depth. Lifelong learning approaches potentially offer significant
cost savings in terms of model training, data storage, and collection. However,
the quality of RGB images and depth maps is sensor-dependent, and depth maps in
the real world exhibit domain-specific characteristics, leading to variations
in depth ranges. These challenges limit existing methods to lifelong learning
scenarios with small domain gaps and relative depth map estimation. To
facilitate lifelong metric depth learning, we identify three crucial technical
challenges that require attention: i) developing a model capable of addressing
the depth scale variation through scale-aware depth learning, ii) devising an
effective learning strategy to handle significant domain gaps, and iii)
creating an automated solution for domain-aware depth inference in practical
applications. Based on the aforementioned considerations, in this paper, we
present i) a lightweight multi-head framework that effectively tackles the
depth scale imbalance, ii) an uncertainty-aware lifelong learning solution that
adeptly handles significant domain gaps, and iii) an online domain-specific
predictor selection method for real-time inference. Through extensive numerical
studies, we show that the proposed method can achieve good efficiency,
stability, and plasticity, leading the benchmarks by 8% to 15%
Title Learning Latent Subevents in Activity Videos Using Temporal Attention Filters
In this paper, we newly introduce the concept of temporal attention filters, and describe how they can be used for human activity recognition from videos. Many high-level activities are often composed of multiple temporal parts (e.g., sub-events) with different duration/speed, and our objective is to make the model explicitly learn such temporal structure using multiple attention filters and benefit from them. Our temporal filters are designed to be fully differentiable, allowing end-of-end training of the temporal filters together with the underlying frame-based or segment-based convolutional neural network architectures. This paper presents an approach of learning a set of optimal static temporal attention filters to be shared across different videos, and extends this approach to dynamically adjust attention filters per testing video using recurrent long short-term memory networks (LSTMs). This allows our temporal attention filters to learn latent sub-events specific to each activity. We experimentally confirm that the proposed concept of temporal attention filters benefits the activity recognition, and we visualize the learned latent sub-events