16 research outputs found
Minimum Latency Deep Online Video Stabilization
We present a novel camera path optimization framework for the task of online
video stabilization. Typically, a stabilization pipeline consists of three
steps: motion estimating, path smoothing, and novel view rendering. Most
previous methods concentrate on motion estimation, proposing various global or
local motion models. In contrast, path optimization receives relatively less
attention, especially in the important online setting, where no future frames
are available. In this work, we adopt recent off-the-shelf high-quality deep
motion models for motion estimation to recover the camera trajectory and focus
on the latter two steps. Our network takes a short 2D camera path in a sliding
window as input and outputs the stabilizing warp field of the last frame in the
window, which warps the coming frame to its stabilized position. A hybrid loss
is well-defined to constrain the spatial and temporal consistency. In addition,
we build a motion dataset that contains stable and unstable motion pairs for
the training. Extensive experiments demonstrate that our approach significantly
outperforms state-of-the-art online methods both qualitatively and
quantitatively and achieves comparable performance to offline methods. Our code
and dataset are available at https://github.com/liuzhen03/NNDVSComment: Accepted by ICCV 202
GlobalFlowNet: Video Stabilization using Deep Distilled Global Motion Estimates
Videos shot by laymen using hand-held cameras contain undesirable shaky
motion. Estimating the global motion between successive frames, in a manner not
influenced by moving objects, is central to many video stabilization
techniques, but poses significant challenges. A large body of work uses 2D
affine transformations or homography for the global motion. However, in this
work, we introduce a more general representation scheme, which adapts any
existing optical flow network to ignore the moving objects and obtain a
spatially smooth approximation of the global motion between video frames. We
achieve this by a knowledge distillation approach, where we first introduce a
low pass filter module into the optical flow network to constrain the predicted
optical flow to be spatially smooth. This becomes our student network, named as
\textsc{GlobalFlowNet}. Then, using the original optical flow network as the
teacher network, we train the student network using a robust loss function.
Given a trained \textsc{GlobalFlowNet}, we stabilize videos using a two stage
process. In the first stage, we correct the instability in affine parameters
using a quadratic programming approach constrained by a user-specified cropping
limit to control loss of field of view. In the second stage, we stabilize the
video further by smoothing global motion parameters, expressed using a small
number of discrete cosine transform coefficients. In extensive experiments on a
variety of different videos, our technique outperforms state of the art
techniques in terms of subjective quality and different quantitative measures
of video stability. The source code is publicly available at
\href{https://github.com/GlobalFlowNet/GlobalFlowNet}{https://github.com/GlobalFlowNet/GlobalFlowNet}Comment: Accepted in WACV 202
Adaptive Sampling-based Particle Filter for Visual-inertial Gimbal in the Wild
In this paper, we present a Computer Vision (CV) based tracking and fusion
algorithm, dedicated to a 3D printed gimbal system on drones operating in
nature. The whole gimbal system can stabilize the camera orientation robustly
in a challenging nature scenario by using skyline and ground plane as
references. Our main contributions are the following: a) a light-weight
Resnet-18 backbone network model was trained from scratch, and deployed onto
the Jetson Nano platform to segment the image into binary parts (ground and
sky); b) our geometry assumption from nature cues delivers the potential for
robust visual tracking by using the skyline and ground plane as a reference; c)
a spherical surface-based adaptive particle sampling, can fuse orientation from
multiple sensor sources flexibly. The whole algorithm pipeline is tested on our
customized gimbal module including Jetson and other hardware components. The
experiments were performed on top of a building in the real landscape.Comment: content in 6 pages, 9 figures, 2 pseudo codes, one table, accepted by
ICRA 202
Beyond the Camera: Neural Networks in World Coordinates
Eye movement and strategic placement of the visual field onto the retina,
gives animals increased resolution of the scene and suppresses distracting
information. This fundamental system has been missing from video understanding
with deep networks, typically limited to 224 by 224 pixel content locked to the
camera frame. We propose a simple idea, WorldFeatures, where each feature at
every layer has a spatial transformation, and the feature map is only
transformed as needed. We show that a network built with these WorldFeatures,
can be used to model eye movements, such as saccades, fixation, and smooth
pursuit, even in a batch setting on pre-recorded video. That is, the network
can for example use all 224 by 224 pixels to look at a small detail one moment,
and the whole scene the next. We show that typical building blocks, such as
convolutions and pooling, can be adapted to support WorldFeatures using
available tools. Experiments are presented on the Charades, Olympic Sports, and
Caltech-UCSD Birds-200-2011 datasets, exploring action recognition,
fine-grained recognition, and video stabilization
Supervised Homography Learning with Realistic Dataset Generation
In this paper, we propose an iterative framework, which consists of two
phases: a generation phase and a training phase, to generate realistic training
data and yield a supervised homography network. In the generation phase, given
an unlabeled image pair, we utilize the pre-estimated dominant plane masks and
homography of the pair, along with another sampled homography that serves as
ground truth to generate a new labeled training pair with realistic motion. In
the training phase, the generated data is used to train the supervised
homography network, in which the training data is refined via a content
consistency module and a quality assessment module. Once an iteration is
finished, the trained network is used in the next data generation phase to
update the pre-estimated homography. Through such an iterative strategy, the
quality of the dataset and the performance of the network can be gradually and
simultaneously improved. Experimental results show that our method achieves
state-of-the-art performance and existing supervised methods can be also
improved based on the generated dataset. Code and dataset are available at
https://github.com/megvii-research/RealSH.Comment: Accepted by ICCV 202