4,573 research outputs found
Object-Adaptive LSTM Network for Real-time Visual Tracking with Adversarial Data Augmentation
In recent years, deep learning based visual tracking methods have obtained
great success owing to the powerful feature representation ability of
Convolutional Neural Networks (CNNs). Among these methods, classification-based
tracking methods exhibit excellent performance while their speeds are heavily
limited by the expensive computation for massive proposal feature extraction.
In contrast, matching-based tracking methods (such as Siamese networks) possess
remarkable speed superiority. However, the absence of online updating renders
these methods unadaptable to significant object appearance variations. In this
paper, we propose a novel real-time visual tracking method, which adopts an
object-adaptive LSTM network to effectively capture the video sequential
dependencies and adaptively learn the object appearance variations. For high
computational efficiency, we also present a fast proposal selection strategy,
which utilizes the matching-based tracking method to pre-estimate dense
proposals and selects high-quality ones to feed to the LSTM network for
classification. This strategy efficiently filters out some irrelevant proposals
and avoids the redundant computation for feature extraction, which enables our
method to operate faster than conventional classification-based tracking
methods. In addition, to handle the problems of sample inadequacy and class
imbalance during online tracking, we adopt a data augmentation technique based
on the Generative Adversarial Network (GAN) to facilitate the training of the
LSTM network. Extensive experiments on four visual tracking benchmarks
demonstrate the state-of-the-art performance of our method in terms of both
tracking accuracy and speed, which exhibits great potentials of recurrent
structures for visual tracking
Looking Fast and Slow: Memory-Guided Mobile Video Object Detection
With a single eye fixation lasting a fraction of a second, the human visual
system is capable of forming a rich representation of a complex environment,
reaching a holistic understanding which facilitates object recognition and
detection. This phenomenon is known as recognizing the "gist" of the scene and
is accomplished by relying on relevant prior knowledge. This paper addresses
the analogous question of whether using memory in computer vision systems can
not only improve the accuracy of object detection in video streams, but also
reduce the computation time. By interleaving conventional feature extractors
with extremely lightweight ones which only need to recognize the gist of the
scene, we show that minimal computation is required to produce accurate
detections when temporal memory is present. In addition, we show that the
memory contains enough information for deploying reinforcement learning
algorithms to learn an adaptive inference policy. Our model achieves
state-of-the-art performance among mobile methods on the Imagenet VID 2015
dataset, while running at speeds of up to 70+ FPS on a Pixel 3 phone
Re3 : Real-Time Recurrent Regression Networks for Visual Tracking of Generic Objects
Robust object tracking requires knowledge and understanding of the object
being tracked: its appearance, its motion, and how it changes over time. A
tracker must be able to modify its underlying model and adapt to new
observations. We present Re3, a real-time deep object tracker capable of
incorporating temporal information into its model. Rather than focusing on a
limited set of objects or training a model at test-time to track a specific
instance, we pretrain our generic tracker on a large variety of objects and
efficiently update on the fly; Re3 simultaneously tracks and updates the
appearance model with a single forward pass. This lightweight model is capable
of tracking objects at 150 FPS, while attaining competitive results on
challenging benchmarks. We also show that our method handles temporary
occlusion better than other comparable trackers using experiments that directly
measure performance on sequences with occlusion.Comment: Presented at ICRA 201
Hierarchical Attentive Recurrent Tracking
Class-agnostic object tracking is particularly difficult in cluttered
environments as target specific discriminative models cannot be learned a
priori. Inspired by how the human visual cortex employs spatial attention and
separate "where" and "what" processing pathways to actively suppress irrelevant
visual features, this work develops a hierarchical attentive recurrent model
for single object tracking in videos. The first layer of attention discards the
majority of background by selecting a region containing the object of interest,
while the subsequent layers tune in on visual features particular to the
tracked object. This framework is fully differentiable and can be trained in a
purely data driven fashion by gradient methods. To improve training
convergence, we augment the loss function with terms for a number of auxiliary
tasks relevant for tracking. Evaluation of the proposed model is performed on
two datasets: pedestrian tracking on the KTH activity recognition dataset and
the more difficult KITTI object tracking dataset.Comment: Published as a conference paper at NIPS 2017. Code is available at
https://github.com/akosiorek/hart and qualitative results are available at
https://youtu.be/Vvkjm0FRGS
Recurrent Filter Learning for Visual Tracking
Recently using convolutional neural networks (CNNs) has gained popularity in
visual tracking, due to its robust feature representation of images. Recent
methods perform online tracking by fine-tuning a pre-trained CNN model to the
specific target object using stochastic gradient descent (SGD)
back-propagation, which is usually time-consuming. In this paper, we propose a
recurrent filter generation methods for visual tracking. We directly feed the
target's image patch to a recurrent neural network (RNN) to estimate an
object-specific filter for tracking. As the video sequence is a spatiotemporal
data, we extend the matrix multiplications of the fully-connected layers of the
RNN to a convolution operation on feature maps, which preserves the target's
spatial structure and also is memory-efficient. The tracked object in the
subsequent frames will be fed into the RNN to adapt the generated filters to
appearance variations of the target. Note that once the off-line training
process of our network is finished, there is no need to fine-tune the network
for specific objects, which makes our approach more efficient than methods that
use iterative fine-tuning to online learn the target. Extensive experiments
conducted on widely used benchmarks, OTB and VOT, demonstrate encouraging
results compared to other recent methods.Comment: ICCV2017 Workshop on VO
Differentiating Objects by Motion: Joint Detection and Tracking of Small Flying Objects
While generic object detection has achieved large improvements with rich
feature hierarchies from deep nets, detecting small objects with poor visual
cues remains challenging. Motion cues from multiple frames may be more
informative for detecting such hard-to-distinguish objects in each frame.
However, how to encode discriminative motion patterns, such as deformations and
pose changes that characterize objects, has remained an open question. To learn
them and thereby realize small object detection, we present a neural model
called the Recurrent Correlational Network, where detection and tracking are
jointly performed over a multi-frame representation learned through a single,
trainable, and end-to-end network. A convolutional long short-term memory
network is utilized for learning informative appearance change for detection,
while learned representation is shared in tracking for enhancing its
performance. In experiments with datasets containing images of scenes with
small flying objects, such as birds and unmanned aerial vehicles, the proposed
method yielded consistent improvements in detection performance over deep
single-frame detectors and existing motion-based detectors. Furthermore, our
network performs as well as state-of-the-art generic object trackers when it
was evaluated as a tracker on the bird dataset.Comment: 10 pages, 8 figure
Context-Aware Deep Spatio-Temporal Network for Hand Pose Estimation from Depth Images
As a fundamental and challenging problem in computer vision, hand pose
estimation aims to estimate the hand joint locations from depth images.
Typically, the problem is modeled as learning a mapping function from images to
hand joint coordinates in a data-driven manner. In this paper, we propose
Context-Aware Deep Spatio-Temporal Network (CADSTN), a novel method to jointly
model the spatio-temporal properties for hand pose estimation. Our proposed
network is able to learn the representations of the spatial information and the
temporal structure from the image sequences. Moreover, by adopting adaptive
fusion method, the model is capable of dynamically weighting different
predictions to lay emphasis on sufficient context. Our method is examined on
two common benchmarks, the experimental results demonstrate that our proposed
approach achieves the best or the second-best performance with state-of-the-art
methods and runs in 60fps.Comment: IEEE Transactions On Cybernetic
Video-based Sign Language Recognition without Temporal Segmentation
Millions of hearing impaired people around the world routinely use some
variants of sign languages to communicate, thus the automatic translation of a
sign language is meaningful and important. Currently, there are two
sub-problems in Sign Language Recognition (SLR), i.e., isolated SLR that
recognizes word by word and continuous SLR that translates entire sentences.
Existing continuous SLR methods typically utilize isolated SLRs as building
blocks, with an extra layer of preprocessing (temporal segmentation) and
another layer of post-processing (sentence synthesis). Unfortunately, temporal
segmentation itself is non-trivial and inevitably propagates errors into
subsequent steps. Worse still, isolated SLR methods typically require strenuous
labeling of each word separately in a sentence, severely limiting the amount of
attainable training data. To address these challenges, we propose a novel
continuous sign recognition framework, the Hierarchical Attention Network with
Latent Space (LS-HAN), which eliminates the preprocessing of temporal
segmentation. The proposed LS-HAN consists of three components: a two-stream
Convolutional Neural Network (CNN) for video feature representation generation,
a Latent Space (LS) for semantic gap bridging, and a Hierarchical Attention
Network (HAN) for latent space based recognition. Experiments are carried out
on two large scale datasets. Experimental results demonstrate the effectiveness
of the proposed framework.Comment: 32nd AAAI Conference on Artificial Intelligence (AAAI-18), Feb. 2-7,
2018, New Orleans, Louisiana, US
Endo-VMFuseNet: Deep Visual-Magnetic Sensor Fusion Approach for Uncalibrated, Unsynchronized and Asymmetric Endoscopic Capsule Robot Localization Data
In the last decade, researchers and medical device companies have made major
advances towards transforming passive capsule endoscopes into active medical
robots. One of the major challenges is to endow capsule robots with accurate
perception of the environment inside the human body, which will provide
necessary information and enable improved medical procedures. We extend the
success of deep learning approaches from various research fields to the problem
of uncalibrated, asynchronous, and asymmetric sensor fusion for endoscopic
capsule robots. The results performed on real pig stomach datasets show that
our method achieves sub-millimeter precision for both translational and
rotational movements and contains various advantages over traditional sensor
fusion techniques.Comment: Submitted to ICRA 201
Cascade LSTM Based Visual-Inertial Navigation for Magnetic Levitation Haptic Interaction
Haptic feedback is essential to acquire immersive experience when interacting
in virtual or augmented reality. Although the existing promising magnetic
levitation (maglev) haptic system has advantages of none mechanical friction,
its performance is limited by its navigation method, which mainly results from
the challenge that it is difficult to obtain high precision, high frame rate
and good stability with lightweight design at the same. In this study, we
propose to perform the visual-inertial fusion navigation based on
sequence-to-sequence learning for the maglev haptic interaction. Cascade LSTM
based-increment learning method is first presented to progressively learn the
increments of the target variables. Then, two cascade LSTM networks are
separately trained for accomplishing the visual-inertial fusion navigation in a
loosely-coupled mode. Additionally, we set up a maglev haptic platform as the
system testbed. Experimental results show that the proposed cascade LSTM
based-increment learning method can achieve high-precision prediction, and our
cascade LSTM based visual-inertial fusion navigation method can reach 200Hz
while maintaining high-precision (the mean absolute error of the position and
orientation is respectively less than 1mm and 0.02{\deg})navigation for the
maglev haptic interaction application
- …