56,309 research outputs found
RATM: Recurrent Attentive Tracking Model
We present an attention-based modular neural framework for computer vision.
The framework uses a soft attention mechanism allowing models to be trained
with gradient descent. It consists of three modules: a recurrent attention
module controlling where to look in an image or video frame, a
feature-extraction module providing a representation of what is seen, and an
objective module formalizing why the model learns its attentive behavior. The
attention module allows the model to focus computation on task-related
information in the input. We apply the framework to several object tracking
tasks and explore various design choices. We experiment with three data sets,
bouncing ball, moving digits and the real-world KTH data set. The proposed
Recurrent Attentive Tracking Model performs well on all three tasks and can
generalize to related but previously unseen sequences from a challenging
tracking data set
Pixel-Level Matching for Video Object Segmentation using Convolutional Neural Networks
We propose a novel video object segmentation algorithm based on pixel-level
matching using Convolutional Neural Networks (CNN). Our network aims to
distinguish the target area from the background on the basis of the pixel-level
similarity between two object units. The proposed network represents a target
object using features from different depth layers in order to take advantage of
both the spatial details and the category-level semantic information.
Furthermore, we propose a feature compression technique that drastically
reduces the memory requirements while maintaining the capability of feature
representation. Two-stage training (pre-training and fine-tuning) allows our
network to handle any target object regardless of its category (even if the
object's type does not belong to the pre-training data) or of variations in its
appearance through a video sequence. Experiments on large datasets demonstrate
the effectiveness of our model - against related methods - in terms of
accuracy, speed, and stability. Finally, we introduce the transferability of
our network to different domains, such as the infrared data domain.Comment: To appear on ICCV 201
CloudAR: A Cloud-based Framework for Mobile Augmented Reality
Computation capabilities of recent mobile devices enable natural feature
processing for Augmented Reality (AR). However, mobile AR applications are
still faced with scalability and performance challenges. In this paper, we
propose CloudAR, a mobile AR framework utilizing the advantages of cloud and
edge computing through recognition task offloading. We explore the design space
of cloud-based AR exhaustively and optimize the offloading pipeline to minimize
the time and energy consumption. We design an innovative tracking system for
mobile devices which provides lightweight tracking in 6 degree of freedom
(6DoF) and hides the offloading latency from users' perception. We also design
a multi-object image retrieval pipeline that executes fast and accurate image
recognition tasks on servers. In our evaluations, the mobile AR application
built with the CloudAR framework runs at 30 frames per second (FPS) on average
with precise tracking of only 1~2 pixel errors and image recognition of at
least 97% accuracy. Our results also show that CloudAR outperforms one of the
leading commercial AR framework in several performance metrics
Generic Multiview Visual Tracking
Recent progresses in visual tracking have greatly improved the tracking
performance. However, challenges such as occlusion and view change remain
obstacles in real world deployment. A natural solution to these challenges is
to use multiple cameras with multiview inputs, though existing systems are
mostly limited to specific targets (e.g. human), static cameras, and/or camera
calibration. To break through these limitations, we propose a generic multiview
tracking (GMT) framework that allows camera movement, while requiring neither
specific object model nor camera calibration. A key innovation in our framework
is a cross-camera trajectory prediction network (TPN), which implicitly and
dynamically encodes camera geometric relations, and hence addresses missing
target issues such as occlusion. Moreover, during tracking, we assemble
information across different cameras to dynamically update a novel
collaborative correlation filter (CCF), which is shared among cameras to
achieve robustness against view change. The two components are integrated into
a correlation filter tracking framework, where the features are trained offline
using existing single view tracking datasets. For evaluation, we first
contribute a new generic multiview tracking dataset (GMTD) with careful
annotations, and then run experiments on GMTD and the PETS2009 datasets. On
both datasets, the proposed GMT algorithm shows clear advantages over
state-of-the-art ones
Self-taught learning of a deep invariant representation for visual tracking via temporal slowness principle
Visual representation is crucial for a visual tracking method's performances.
Conventionally, visual representations adopted in visual tracking rely on
hand-crafted computer vision descriptors. These descriptors were developed
generically without considering tracking-specific information. In this paper,
we propose to learn complex-valued invariant representations from tracked
sequential image patches, via strong temporal slowness constraint and stacked
convolutional autoencoders. The deep slow local representations are learned
offline on unlabeled data and transferred to the observational model of our
proposed tracker. The proposed observational model retains old training samples
to alleviate drift, and collect negative samples which are coherent with
target's motion pattern for better discriminative tracking. With the learned
representation and online training samples, a logistic regression classifier is
adopted to distinguish target from background, and retrained online to adapt to
appearance changes. Subsequently, the observational model is integrated into a
particle filter framework to peform visual tracking. Experimental results on
various challenging benchmark sequences demonstrate that the proposed tracker
performs favourably against several state-of-the-art trackers.Comment: Pattern Recognition (Elsevier), 201
Track Everything: Limiting Prior Knowledge in Online Multi-Object Recognition
This paper addresses the problem of online tracking and classification of
multiple objects in an image sequence. Our proposed solution is to first track
all objects in the scene without relying on object-specific prior knowledge,
which in other systems can take the form of hand-crafted features or user-based
track initialization. We then classify the tracked objects with a fast-learning
image classifier that is based on a shallow convolutional neural network
architecture and demonstrate that object recognition improves when this is
combined with object state information from the tracking algorithm. We argue
that by transferring the use of prior knowledge from the detection and tracking
stages to the classification stage we can design a robust, general purpose
object recognition system with the ability to detect and track a variety of
object types. We describe our biologically inspired implementation, which
adaptively learns the shape and motion of tracked objects, and apply it to the
Neovision2 Tower benchmark data set, which contains multiple object types. An
experimental evaluation demonstrates that our approach is competitive with
state-of-the-art video object recognition systems that do make use of
object-specific prior knowledge in detection and tracking, while providing
additional practical advantages by virtue of its generality.Comment: 15 page
An unsupervised long short-term memory neural network for event detection in cell videos
We propose an automatic unsupervised cell event detection and classification
method, which expands convolutional Long Short-Term Memory (LSTM) neural
networks, for cellular events in cell video sequences. Cells in images that are
captured from various biomedical applications usually have different shapes and
motility, which pose difficulties for the automated event detection in cell
videos. Current methods to detect cellular events are based on supervised
machine learning and rely on tedious manual annotation from investigators with
specific expertise. So that our LSTM network could be trained in an
unsupervised manner, we designed it with a branched structure where one branch
learns the frequent, regular appearance and movements of objects and the second
learns the stochastic events, which occur rarely and without warning in a cell
video sequence. We tested our network on a publicly available dataset of
densely packed stem cell phase-contrast microscopy images undergoing cell
division. This dataset is considered to be more challenging that a dataset with
sparse cells. We compared our method to several published supervised methods
evaluated on the same dataset and to a supervised LSTM method with a similar
design and configuration to our unsupervised method. We used an F1-score, which
is a balanced measure for both precision and recall. Our results show that our
unsupervised method has a higher or similar F1-score when compared to two fully
supervised methods that are based on Hidden Conditional Random Fields (HCRF),
and has comparable accuracy with the current best supervised HCRF-based method.
Our method was generalizable as after being trained on one video it could be
applied to videos where the cells were in different conditions. The accuracy of
our unsupervised method approached that of its supervised counterpart
Vision-based Traffic Flow Prediction using Dynamic Texture Model and Gaussian Process
In this paper, we describe work in progress towards a real-time vision-based
traffic flow prediction (TFP) system. The proposed method consists of three
elemental operators, that are dynamic texture model based motion segmentation,
feature extraction and Gaussian process (GP) regression. The objective of
motion segmentation is to recognize the target regions covering the moving
vehicles in the sequence of visual processes. The feature extraction operator
aims to extract useful features from the target regions. The extracted features
are then mapped to the number of vehicles through the operator of GP
regression. A training stage using historical visual data is required for
determining the parameter values of the GP. Using a low-resolution visual data
set, we performed preliminary evaluations on the performance of the proposed
method. The results show that our method beats a benchmark solution based on
Gaussian mixture model, and has the potential to be developed into qualified
and practical solutions to real-time TFP.Comment: 8 pages, 4 figures, conferenc
MAVOT: Memory-Augmented Video Object Tracking
We introduce a one-shot learning approach for video object tracking. The
proposed algorithm requires seeing the object to be tracked only once, and
employs an external memory to store and remember the evolving features of the
foreground object as well as backgrounds over time during tracking. With the
relevant memory retrieved and updated in each tracking, our tracking model is
capable of maintaining long-term memory of the object, and thus can naturally
deal with hard tracking scenarios including partial and total occlusion, motion
changes and large scale and shape variations. In our experiments we use the
ImageNet ILSVRC2015 video detection dataset to train and use the VOT-2016
benchmark to test and compare our Memory-Augmented Video Object Tracking
(MAVOT) model. From the results, we conclude that given its oneshot property
and simplicity in design, MAVOT is an attractive approach in visual tracking
because it shows good performance on VOT-2016 benchmark and is among the top 5
performers in accuracy and robustness in occlusion, motion changes and empty
target.Comment: Submitted to CVPR201
Siamese Attentional Keypoint Network for High Performance Visual Tracking
In this paper, we investigate the impacts of three main aspects of visual
tracking, i.e., the backbone network, the attentional mechanism, and the
detection component, and propose a Siamese Attentional Keypoint Network, dubbed
SATIN, for efficient tracking and accurate localization. Firstly, a new Siamese
lightweight hourglass network is specially designed for visual tracking. It
takes advantage of the benefits of the repeated bottom-up and top-down
inference to capture more global and local contextual information at multiple
scales. Secondly, a novel cross-attentional module is utilized to leverage both
channel-wise and spatial intermediate attentional information, which can
enhance both discriminative and localization capabilities of feature maps.
Thirdly, a keypoints detection approach is invented to trace any target object
by detecting the top-left corner point, the centroid point, and the
bottom-right corner point of its bounding box. Therefore, our SATIN tracker not
only has a strong capability to learn more effective object representations,
but also is computational and memory storage efficiency, either during the
training or testing stages. To the best of our knowledge, we are the first to
propose this approach. Without bells and whistles, experimental results
demonstrate that our approach achieves state-of-the-art performance on several
recent benchmark datasets, at a speed far exceeding 27 frames per second.Comment: Accepted by Knowledge-Based SYSTEM
- …