2,638 research outputs found
Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking
In this paper, we develop a new approach of spatially supervised recurrent
convolutional neural networks for visual object tracking. Our recurrent
convolutional network exploits the history of locations as well as the
distinctive visual features learned by the deep neural networks. Inspired by
recent bounding box regression methods for object detection, we study the
regression capability of Long Short-Term Memory (LSTM) in the temporal domain,
and propose to concatenate high-level visual features produced by convolutional
networks with region information. In contrast to existing deep learning based
trackers that use binary classification for region candidates, we use
regression for direct prediction of the tracking locations both at the
convolutional layer and at the recurrent unit. Our extensive experimental
results and performance comparison with state-of-the-art tracking methods on
challenging benchmark video tracking datasets shows that our tracker is more
accurate and robust while maintaining low computational cost. For most test
video sequences, our method achieves the best tracking performance, often
outperforms the second best by a large margin.Comment: 10 pages, 9 figures, conferenc
Self-Attention Recurrent Network for Saliency Detection
Feature maps in deep neural network generally contain different semantics.
Existing methods often omit their characteristics that may lead to sub-optimal
results. In this paper, we propose a novel end-to-end deep saliency network
which could effectively utilize multi-scale feature maps according to their
characteristics. Shallow layers often contain more local information, and deep
layers have advantages in global semantics. Therefore, the network generates
elaborate saliency maps by enhancing local and global information of feature
maps in different layers. On one hand, local information of shallow layers is
enhanced by a recurrent structure which shared convolution kernel at different
time steps. On the other hand, global information of deep layers is utilized by
a self-attention module, which generates different attention weights for
salient objects and backgrounds thus achieve better performance. Experimental
results on four widely used datasets demonstrate that our method has advantages
in performance over existing algorithms
Underwater Multi-Robot Convoying using Visual Tracking by Detection
We present a robust multi-robot convoying approach that relies on visual
detection of the leading agent, thus enabling target following in unstructured
3-D environments. Our method is based on the idea of tracking-by-detection,
which interleaves efficient model-based object detection with temporal
filtering of image-based bounding box estimation. This approach has the
important advantage of mitigating tracking drift (i.e. drifting away from the
target object), which is a common symptom of model-free trackers and is
detrimental to sustained convoying in practice. To illustrate our solution, we
collected extensive footage of an underwater robot in ocean settings, and
hand-annotated its location in each frame. Based on this dataset, we present an
empirical comparison of multiple tracker variants, including the use of several
convolutional neural networks, both with and without recurrent connections, as
well as frequency-based model-free trackers. We also demonstrate the
practicality of this tracking-by-detection strategy in real-world scenarios by
successfully controlling a legged underwater robot in five degrees of freedom
to follow another robot's independent motion.Comment: Accepted to IEEE/RSJ International Conference on Intelligent Robots
and Systems (IROS), 201
Pixel-wise object tracking
In this paper, we propose a novel pixel-wise visual object tracking framework
that can track any anonymous object in a noisy background. The framework
consists of two submodels, a global attention model and a local segmentation
model. The global model generates a region of interests (ROI) that the object
may lie in the new frame based on the past object segmentation maps, while the
local model segments the new image in the ROI. Each model uses a LSTM structure
to model the temporal dynamics of the motion and appearance, respectively. To
circumvent the dependency of the training data between the two models, we use
an iterative update strategy. Once the models are trained, there is no need to
refine them to track specific objects, making our method efficient compared to
online learning approaches. We demonstrate our real time pixel-wise object
tracking framework on a challenging VOT datase
Differentiating Objects by Motion: Joint Detection and Tracking of Small Flying Objects
While generic object detection has achieved large improvements with rich
feature hierarchies from deep nets, detecting small objects with poor visual
cues remains challenging. Motion cues from multiple frames may be more
informative for detecting such hard-to-distinguish objects in each frame.
However, how to encode discriminative motion patterns, such as deformations and
pose changes that characterize objects, has remained an open question. To learn
them and thereby realize small object detection, we present a neural model
called the Recurrent Correlational Network, where detection and tracking are
jointly performed over a multi-frame representation learned through a single,
trainable, and end-to-end network. A convolutional long short-term memory
network is utilized for learning informative appearance change for detection,
while learned representation is shared in tracking for enhancing its
performance. In experiments with datasets containing images of scenes with
small flying objects, such as birds and unmanned aerial vehicles, the proposed
method yielded consistent improvements in detection performance over deep
single-frame detectors and existing motion-based detectors. Furthermore, our
network performs as well as state-of-the-art generic object trackers when it
was evaluated as a tracker on the bird dataset.Comment: 10 pages, 8 figure
Where to Focus: Deep Attention-based Spatially Recurrent Bilinear Networks for Fine-Grained Visual Recognition
Fine-grained visual recognition typically depends on modeling subtle
difference from object parts. However, these parts often exhibit dramatic
visual variations such as occlusions, viewpoints, and spatial transformations,
making it hard to detect. In this paper, we present a novel attention-based
model to automatically, selectively and accurately focus on critical object
regions with higher importance against appearance variations. Given an image,
two different Convolutional Neural Networks (CNNs) are constructed, where the
outputs of two CNNs are correlated through bilinear pooling to simultaneously
focus on discriminative regions and extract relevant features. To capture
spatial distributions among the local regions with visual attention, soft
attention based spatial Long-Short Term Memory units (LSTMs) are incorporated
to realize spatially recurrent yet visually selective over local input
patterns. All the above intuitions equip our network with the following novel
model: two-stream CNN layers, bilinear pooling layer, spatial recurrent layer
with location attention are jointly trained via an end-to-end fashion to serve
as the part detector and feature extractor, whereby relevant features are
localized and extracted attentively. We show the significance of our network
against two well-known visual recognition tasks: fine-grained image
classification and person re-identification.Comment: 8 page
An unsupervised long short-term memory neural network for event detection in cell videos
We propose an automatic unsupervised cell event detection and classification
method, which expands convolutional Long Short-Term Memory (LSTM) neural
networks, for cellular events in cell video sequences. Cells in images that are
captured from various biomedical applications usually have different shapes and
motility, which pose difficulties for the automated event detection in cell
videos. Current methods to detect cellular events are based on supervised
machine learning and rely on tedious manual annotation from investigators with
specific expertise. So that our LSTM network could be trained in an
unsupervised manner, we designed it with a branched structure where one branch
learns the frequent, regular appearance and movements of objects and the second
learns the stochastic events, which occur rarely and without warning in a cell
video sequence. We tested our network on a publicly available dataset of
densely packed stem cell phase-contrast microscopy images undergoing cell
division. This dataset is considered to be more challenging that a dataset with
sparse cells. We compared our method to several published supervised methods
evaluated on the same dataset and to a supervised LSTM method with a similar
design and configuration to our unsupervised method. We used an F1-score, which
is a balanced measure for both precision and recall. Our results show that our
unsupervised method has a higher or similar F1-score when compared to two fully
supervised methods that are based on Hidden Conditional Random Fields (HCRF),
and has comparable accuracy with the current best supervised HCRF-based method.
Our method was generalizable as after being trained on one video it could be
applied to videos where the cells were in different conditions. The accuracy of
our unsupervised method approached that of its supervised counterpart
WW-Nets: Dual Neural Networks for Object Detection
We propose a new deep convolutional neural network framework that uses object
location knowledge implicit in network connection weights to guide selective
attention in object detection tasks. Our approach is called What-Where Nets
(WW-Nets), and it is inspired by the structure of human visual pathways. In the
brain, vision incorporates two separate streams, one in the temporal lobe and
the other in the parietal lobe, called the ventral stream and the dorsal
stream, respectively. The ventral pathway from primary visual cortex is
dominated by "what" information, while the dorsal pathway is dominated by
"where" information. Inspired by this structure, we have proposed an object
detection framework involving the integration of a "What Network" and a "Where
Network". The aim of the What Network is to provide selective attention to the
relevant parts of the input image. The Where Network uses this information to
locate and classify objects of interest. In this paper, we compare this
approach to state-of-the-art algorithms on the PASCAL VOC 2007 and 2012 and
COCO object detection challenge datasets. Also, we compare out approach to
human "ground-truth" attention. We report the results of an eye-tracking
experiment on human subjects using images from PASCAL VOC 2007, and we
demonstrate interesting relationships between human overt attention and
information processing in our WW-Nets. Finally, we provide evidence that our
proposed method performs favorably in comparison to other object detection
approaches, often by a large margin. The code and the eye-tracking ground-truth
dataset can be found at: https://github.com/mkebrahimpour.Comment: 8 pages, 3 figure
Fully-Convolutional Siamese Networks for Object Tracking
The problem of arbitrary object tracking has traditionally been tackled by
learning a model of the object's appearance exclusively online, using as sole
training data the video itself. Despite the success of these methods, their
online-only approach inherently limits the richness of the model they can
learn. Recently, several attempts have been made to exploit the expressive
power of deep convolutional networks. However, when the object to track is not
known beforehand, it is necessary to perform Stochastic Gradient Descent online
to adapt the weights of the network, severely compromising the speed of the
system. In this paper we equip a basic tracking algorithm with a novel
fully-convolutional Siamese network trained end-to-end on the ILSVRC15 dataset
for object detection in video. Our tracker operates at frame-rates beyond
real-time and, despite its extreme simplicity, achieves state-of-the-art
performance in multiple benchmarks.Comment: The first two authors contributed equally, and are listed in
alphabetical order. Code available at
http://www.robots.ox.ac.uk/~luca/siamese-fc.htm
Fast Recurrent Fully Convolutional Networks for Direct Perception in Autonomous Driving
Deep convolutional neural networks (CNNs) have been shown to perform
extremely well at a variety of tasks including subtasks of autonomous driving
such as image segmentation and object classification. However, networks
designed for these tasks typically require vast quantities of training data and
long training periods to converge. We investigate the design rationale behind
end-to-end driving network designs by proposing and comparing three small and
computationally inexpensive deep end-to-end neural network models that generate
driving control signals directly from input images. In contrast to prior work
that segments the autonomous driving task, our models take on a novel approach
to the autonomous driving problem by utilizing deep and thin Fully
Convolutional Nets (FCNs) with recurrent neural nets and low parameter counts
to tackle a complex end-to-end regression task predicting both steering and
acceleration commands. In addition, we include layers optimized for
classification to allow the networks to implicitly learn image semantics. We
show that the resulting networks use 3x fewer parameters than the most recent
comparable end-to-end driving network and 500x fewer parameters than the
AlexNet variations and converge both faster and to lower losses while
maintaining robustness against overfitting.Comment: CVPR 2018 Submission, 9 pages, 6 figures, 3 table
- …