8,724 research outputs found
Efficient Multi-level Correlating for Visual Tracking
Correlation filter (CF) based tracking algorithms have demonstrated favorable
performance recently. Nevertheless, the top performance trackers always employ
complicated optimization methods which constraint their real-time applications.
How to accelerate the tracking speed while retaining the tracking accuracy is a
significant issue. In this paper, we propose a multi-level CF-based tracking
approach named MLCFT which further explores the potential capacity of CF with
two-stage detection: primal detection and oriented re-detection. The cascaded
detection scheme is simple but competent to prevent model drift and accelerate
the speed. An effective fusion method based on relative entropy is introduced
to combine the complementary features extracted from deep and shallow layers of
convolutional neural networks (CNN). Moreover, a novel online model update
strategy is utilized in our tracker, which enhances the tracking performance
further. Experimental results demonstrate that our proposed approach
outperforms the most state-of-the-art trackers while tracking at speed of
exceeded 16 frames per second on challenging benchmarks.Comment: Accepted by ACCV'201
A Computation Control Motion Estimation Method for Complexity-Scalable Video Coding
In this paper, a new Computation-Control Motion Estimation (CCME) method is
proposed which can perform Motion Estimation (ME) adaptively under different
computation or power budgets while keeping high coding performance. We first
propose a new class-based method to measure the Macroblock (MB) importance
where MBs are classified into different classes and their importance is
measured by combining their class information as well as their initial matching
cost information. Based on the new MB importance measure, a complete CCME
framework is then proposed to allocate computation for ME. The proposed method
performs ME in a one-pass flow. Experimental results demonstrate that the
proposed method can allocate computation more accurately than previous methods
and thus has better performance under the same computation budget.Comment: This manuscript is the accepted version for TCSVT (IEEE Transactions
on Circuits and Systems for Video Technology
Learning Spatial-Aware Regressions for Visual Tracking
In this paper, we analyze the spatial information of deep features, and
propose two complementary regressions for robust visual tracking. First, we
propose a kernelized ridge regression model wherein the kernel value is defined
as the weighted sum of similarity scores of all pairs of patches between two
samples. We show that this model can be formulated as a neural network and thus
can be efficiently solved. Second, we propose a fully convolutional neural
network with spatially regularized kernels, through which the filter kernel
corresponding to each output channel is forced to focus on a specific region of
the target. Distance transform pooling is further exploited to determine the
effectiveness of each output channel of the convolution layer. The outputs from
the kernelized ridge regression model and the fully convolutional neural
network are combined to obtain the ultimate response. Experimental results on
two benchmark datasets validate the effectiveness of the proposed method.Comment: To appear in CVPR201
Every Pixel Counts ++: Joint Learning of Geometry and Motion with 3D Holistic Understanding
Learning to estimate 3D geometry in a single frame and optical flow from
consecutive frames by watching unlabeled videos via deep convolutional network
has made significant progress recently. Current state-of-the-art (SoTA) methods
treat the two tasks independently. One typical assumption of the existing depth
estimation methods is that the scenes contain no independent moving objects.
while object moving could be easily modeled using optical flow. In this paper,
we propose to address the two tasks as a whole, i.e. to jointly understand
per-pixel 3D geometry and motion. This eliminates the need of static scene
assumption and enforces the inherent geometrical consistency during the
learning process, yielding significantly improved results for both tasks. We
call our method as "Every Pixel Counts++" or "EPC++". Specifically, during
training, given two consecutive frames from a video, we adopt three parallel
networks to predict the camera motion (MotionNet), dense depth map (DepthNet),
and per-pixel optical flow between two frames (OptFlowNet) respectively. The
three types of information are fed into a holistic 3D motion parser (HMP), and
per-pixel 3D motion of both rigid background and moving objects are
disentangled and recovered. Comprehensive experiments were conducted on
datasets with different scenes, including driving scenario (KITTI 2012 and
KITTI 2015 datasets), mixed outdoor/indoor scenes (Make3D) and synthetic
animation (MPI Sintel dataset). Performance on the five tasks of depth
estimation, optical flow estimation, odometry, moving object segmentation and
scene flow estimation shows that our approach outperforms other SoTA methods.
Code will be available at: https://github.com/chenxuluo/EPC.Comment: Chenxu Luo, Zhenheng Yang, and Peng Wang contributed equally, TPAMI
submissio
Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation
In recent years, deep learning based visual tracking methods have obtained
great success owing to the powerful feature representation ability of
Convolutional Neural Networks (CNNs). Among these methods, classification-based
tracking methods exhibit excellent performance while their speeds are heavily
limited by the expensive computation for massive proposal feature extraction.
In contrast, matching-based tracking methods (such as Siamese networks) possess
remarkable speed superiority. However, the absence of online updating renders
these methods unadaptable to significant object appearance variations. In this
paper, we propose a novel real-time visual tracking method, which adopts an
object-adaptive LSTM network to effectively capture the video sequential
dependencies and adaptively learn the object appearance variations. For high
computational efficiency, we also present a fast proposal selection strategy,
which utilizes the matching-based tracking method to pre-estimate dense
proposals and selects high-quality ones to feed to the LSTM network for
classification. This strategy efficiently filters out some irrelevant proposals
and avoids the redundant computation for feature extraction, which enables our
method to operate faster than conventional classification-based tracking
methods. In addition, to handle the problems of sample inadequacy and class
imbalance during online tracking, we adopt a data augmentation technique based
on the Generative Adversarial Network (GAN) to facilitate the training of the
LSTM network. Extensive experiments on four visual tracking benchmarks
demonstrate the state-of-the-art performance of our method in terms of both
tracking accuracy and speed, which exhibits great potentials of recurrent
structures for visual tracking
Multi-resolution mapping and planning for UAV navigation in attitude constrained environments
In this thesis we aim to bridge the gap between high quality map reconstruction and Unmanned Aerial Vehicles (UAVs) SE(3) motion planning in challenging environments with narrow openings, such as disaster areas, which requires attitude to be considered. We propose an efficient system that leverages the concept of adaptive-resolution volumetric mapping, which naturally integrates with the hierarchical decomposition of space in an octree data structure. Instead of a Truncated Signed Distance Function (TSDF), we adopt mapping of occupancy probabilities in log-odds representation, which allows representation of both surfaces, as well as the entire free, i.e.\ observed space, as opposed to unobserved space. We introduce a method for choosing resolution -on the fly- in real-time by means of a multi-scale max-min pooling of the input depth image. The notion of explicit free space mapping paired with the spatial hierarchy in the data structure, as well as map resolution, allows for collision queries, as needed for robot motion planning, at unprecedented speed. Our mapping strategy supports pinhole cameras as well as spherical sensor models. Additionally, we introduce a first-of-a-kind global minimum cost path search method based on A* that considers attitude along the path. State-of-the-art methods incorporate attitude only in the refinement stage. To make the problem tractable, our method exploits an adaptive and coarse-to-fine approach using global and local A* runs, plus an efficient method to introduce the UAV attitude in the process. We integrate our method with an SE(3) trajectory optimisation method based on a safe-flight-corridor, yielding a complete path planning pipeline.
We quantitatively evaluate our mapping strategy in terms of mapping accuracy, memory, runtime performance, and planning performance showing improvements over the state-of-the-art, particularly in cases requiring high resolution maps. Furthermore, extensive evaluation is undertaken using the AirSim flight simulator under closed loop control in a set of randomised maps, allowing us to quantitatively assess our path initialisation method. We show that it achieves significantly higher success rates than the baselines, at a reduced computational burden.Open Acces
SceneFlowFields++: Multi-frame Matching, Visibility Prediction, and Robust Interpolation for Scene Flow Estimation
State-of-the-art scene flow algorithms pursue the conflicting targets of
accuracy, run time, and robustness. With the successful concept of pixel-wise
matching and sparse-to-dense interpolation, we push the limits of scene flow
estimation. Avoiding strong assumptions on the domain or the problem yields a
more robust algorithm. This algorithm is fast because we avoid explicit
regularization during matching, which allows an efficient computation. Using
image information from multiple time steps and explicit visibility prediction
based on previous results, we achieve competitive performances on different
data sets. Our contributions and results are evaluated in comparative
experiments. Overall, we present an accurate scene flow algorithm that is
faster and more generic than any individual benchmark leader.Comment: arXiv admin note: text overlap with arXiv:1710.1009
Real-Time Area Coverage and Target Localization using Receding-Horizon Ergodic Exploration
Although a number of solutions exist for the problems of coverage, search and
target localization---commonly addressed separately---whether there exists a
unified strategy that addresses these objectives in a coherent manner without
being application-specific remains a largely open research question. In this
paper, we develop a receding-horizon ergodic control approach, based on hybrid
systems theory, that has the potential to fill this gap. The nonlinear model
predictive control algorithm plans real-time motions that optimally improve
ergodicity with respect to a distribution defined by the expected information
density across the sensing domain. We establish a theoretical framework for
global stability guarantees with respect to a distribution. Moreover, the
approach is distributable across multiple agents, so that each agent can
independently compute its own control while sharing statistics of its coverage
across a communication network. We demonstrate the method in both simulation
and in experiment in the context of target localization, illustrating that the
algorithm is independent of the number of targets being tracked and can be run
in real-time on computationally limited hardware platforms.Comment: 18 page
Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images
As a fundamental and challenging problem in computer vision, hand pose
estimation aims to estimate the hand joint locations from depth images.
Typically, the problem is modeled as learning a mapping function from images to
hand joint coordinates in a data-driven manner. In this paper, we propose
Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly
model the spatio-temporal properties for hand pose estimation. Our proposed
network is able to learn the representations of the spatial information and the
temporal structure from the image sequences. Moreover, by adopting adaptive
fusion method, the model is capable of dynamically weighting different
predictions to lay emphasis on sufficient context. Our method is examined on
two common benchmarks, the experimental results demonstrate that our proposed
approach achieves the best or the second-best performance with state-of-the-art
methods and runs in 60fps.Comment: IEEE Transactions On Cybernetic
Deep Learning-Based Video Coding: A Review and A Case Study
The past decade has witnessed great success of deep learning technology in
many disciplines, especially in computer vision and image processing. However,
deep learning-based video coding remains in its infancy. This paper reviews the
representative works about using deep learning for image/video coding, which
has been an actively developing research area since the year of 2015. We divide
the related works into two categories: new coding schemes that are built
primarily upon deep networks (deep schemes), and deep network-based coding
tools (deep tools) that shall be used within traditional coding schemes or
together with traditional coding tools. For deep schemes, pixel probability
modeling and auto-encoder are the two approaches, that can be viewed as
predictive coding scheme and transform coding scheme, respectively. For deep
tools, there have been several proposed techniques using deep learning to
perform intra-picture prediction, inter-picture prediction, cross-channel
prediction, probability distribution prediction, transform, post- or in-loop
filtering, down- and up-sampling, as well as encoding optimizations. In the
hope of advocating the research of deep learning-based video coding, we present
a case study of our developed prototype video codec, namely Deep Learning Video
Coding (DLVC). DLVC features two deep tools that are both based on
convolutional neural network (CNN), namely CNN-based in-loop filter (CNN-ILF)
and CNN-based block adaptive resolution coding (CNN-BARC). Both tools help
improve the compression efficiency by a significant margin. With the two deep
tools as well as other non-deep coding tools, DLVC is able to achieve on
average 39.6\% and 33.0\% bits saving than HEVC, under random-access and
low-delay configurations, respectively. The source code of DLVC has been
released for future researches
- …