68,454 research outputs found
Spatial-Temporal Relation Networks for Multi-Object Tracking
Recent progress in multiple object tracking (MOT) has shown that a robust
similarity score is key to the success of trackers. A good similarity score is
expected to reflect multiple cues, e.g. appearance, location, and topology,
over a long period of time. However, these cues are heterogeneous, making them
hard to be combined in a unified network. As a result, existing methods usually
encode them in separate networks or require a complex training approach. In
this paper, we present a unified framework for similarity measurement which
could simultaneously encode various cues and perform reasoning across both
spatial and temporal domains. We also study the feature representation of a
tracklet-object pair in depth, showing a proper design of the pair features can
well empower the trackers. The resulting approach is named spatial-temporal
relation networks (STRN). It runs in a feed-forward way and can be trained in
an end-to-end manner. The state-of-the-art accuracy was achieved on all of the
MOT15-17 benchmarks using public detection and online settings
Intelligent Intersection: Two-Stream Convolutional Networks for Real-time Near Accident Detection in Traffic Video
In Intelligent Transportation System, real-time systems that monitor and
analyze road users become increasingly critical as we march toward the smart
city era. Vision-based frameworks for Object Detection, Multiple Object
Tracking, and Traffic Near Accident Detection are important applications of
Intelligent Transportation System, particularly in video surveillance and etc.
Although deep neural networks have recently achieved great success in many
computer vision tasks, a uniformed framework for all the three tasks is still
challenging where the challenges multiply from demand for real-time
performance, complex urban setting, highly dynamic traffic event, and many
traffic movements. In this paper, we propose a two-stream Convolutional Network
architecture that performs real-time detection, tracking, and near accident
detection of road users in traffic video data. The two-stream model consists of
a spatial stream network for Object Detection and a temporal stream network to
leverage motion features for Multiple Object Tracking. We detect near accidents
by incorporating appearance features and motion features from two-stream
networks. Using aerial videos, we propose a Traffic Near Accident Dataset
(TNAD) covering various types of traffic interactions that is suitable for
vision-based traffic analysis tasks. Our experiments demonstrate the advantage
of our framework with an overall competitive qualitative and quantitative
performance at high frame rates on the TNAD dataset.Comment: Submitted to ACM Transactions on Spatial Algorithms and Systems
(TSAS); Special issue on Urban Mobility: Algorithms and Systems. arXiv admin
note: text overlap with arXiv:1703.07402 by other author
Deep Affinity Network for Multiple Object Tracking
Multiple Object Tracking (MOT) plays an important role in solving many
fundamental problems in video analysis in computer vision. Most MOT methods
employ two steps: Object Detection and Data Association. The first step detects
objects of interest in every frame of a video, and the second establishes
correspondence between the detected objects in different frames to obtain their
tracks. Object detection has made tremendous progress in the last few years due
to deep learning. However, data association for tracking still relies on hand
crafted constraints such as appearance, motion, spatial proximity, grouping
etc. to compute affinities between the objects in different frames. In this
paper, we harness the power of deep learning for data association in tracking
by jointly modelling object appearances and their affinities between different
frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN)
learns compact; yet comprehensive features of pre-detected objects at several
levels of abstraction, and performs exhaustive pairing permutations of those
features in any two frames to infer object affinities. DAN also accounts for
multiple objects appearing and disappearing between video frames. We exploit
the resulting efficient affinity computations to associate objects in the
current frame deep into the previous frames for reliable on-line tracking. Our
technique is evaluated on popular multiple object tracking challenges MOT15,
MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics
demonstrates that our approach is among the best performing techniques on the
leader board for these challenges. The open source implementation of our work
is available at https://github.com/shijieS/SST.git.Comment: To appear in IEEE TPAM
T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos
The state-of-the-art performance for object detection has been significantly
improved over the past two years. Besides the introduction of powerful deep
neural networks such as GoogleNet and VGG, novel object detection frameworks
such as R-CNN and its successors, Fast R-CNN and Faster R-CNN, play an
essential role in improving the state-of-the-art. Despite their effectiveness
on still images, those frameworks are not specifically designed for object
detection from videos. Temporal and contextual information of videos are not
fully investigated and utilized. In this work, we propose a deep learning
framework that incorporates temporal and contextual information from tubelets
obtained in videos, which dramatically improves the baseline performance of
existing still-image detection frameworks when they are applied to videos. It
is called T-CNN, i.e. tubelets with convolutional neueral networks. The
proposed framework won the recently introduced object-detection-from-video
(VID) task with provided data in the ImageNet Large-Scale Visual Recognition
Challenge 2015 (ILSVRC2015).Comment: ImageNet 2015 VID challenge tech report. The first two authors share
co-first authorship. Accepted as a Transaction paper by T-CSVT Special Issue
on Large Scale and Nonlinear Similarity Learning for Intelligent Video
Analysi
A Hybrid Data Association Framework for Robust Online Multi-Object Tracking
Global optimization algorithms have shown impressive performance in
data-association based multi-object tracking, but handling online data remains
a difficult hurdle to overcome. In this paper, we present a hybrid data
association framework with a min-cost multi-commodity network flow for robust
online multi-object tracking. We build local target-specific models interleaved
with global optimization of the optimal data association over multiple video
frames. More specifically, in the min-cost multi-commodity network flow, the
target-specific similarities are online learned to enforce the local
consistency for reducing the complexity of the global data association.
Meanwhile, the global data association taking multiple video frames into
account alleviates irrecoverable errors caused by the local data association
between adjacent frames. To ensure the efficiency of online tracking, we give
an efficient near-optimal solution to the proposed min-cost multi-commodity
flow problem, and provide the empirical proof of its sub-optimality. The
comprehensive experiments on real data demonstrate the superior tracking
performance of our approach in various challenging situations
PointIT: A Fast Tracking Framework Based on 3D Instance Segmentation
Recently most popular tracking frameworks focus on 2D image sequences. They
seldom track the 3D object in point clouds. In this paper, we propose PointIT,
a fast, simple tracking method based on 3D on-road instance segmentation.
Firstly, we transform 3D LiDAR data into the spherical image with the size of
64 x 512 x 4 and feed it into instance segment model to get the predicted
instance mask for each class. Then we use MobileNet as our primary encoder
instead of the original ResNet to reduce the computational complexity. Finally,
we extend the Sort algorithm with this instance framework to realize tracking
in the 3D LiDAR point cloud data. The model is trained on the spherical images
dataset with the corresponding instance label masks which are provided by KITTI
3D Object Track dataset. According to the experiment results, our network can
achieve on Average Precision (AP) of 0.617 and the performance of
multi-tracking task has also been improved
Learning a Robust Society of Tracking Parts using Co-occurrence Constraints
Object tracking is an essential problem in computer vision that has been
researched for several decades. One of the main challenges in tracking is to
adapt to object appearance changes over time and avoiding drifting to
background clutter. We address this challenge by proposing a deep neural
network composed of different parts, which functions as a society of tracking
parts. They work in conjunction according to a certain policy and learn from
each other in a robust manner, using co-occurrence constraints that ensure
robust inference and learning. From a structural point of view, our network is
composed of two main pathways. One pathway is more conservative. It carefully
monitors a large set of simple tracker parts learned as linear filters over
deep feature activation maps. It assigns the parts different roles. It promotes
the reliable ones and removes the inconsistent ones. We learn these filters
simultaneously in an efficient way, with a single closed-form formulation, for
which we propose novel theoretical properties. The second pathway is more
progressive. It is learned completely online and thus it is able to better
model object appearance changes. In order to adapt in a robust manner, it is
learned only on highly confident frames, which are decided using co-occurrences
with the first pathway. Thus, our system has the full benefit of two main
approaches in tracking. The larger set of simpler filter parts offers
robustness, while the full deep network learned online provides adaptability to
change. As shown in the experimental section, our approach achieves state of
the art performance on the challenging VOT17 benchmark, outperforming the
published methods both on the general EAO metric and in the number of fails, by
a significant margin.Comment: 17+3 pages, 5 figures, European Conference on Computer Vision (ECCV),
Visual Object Tracking worksho
Frame-wise Motion and Appearance for Real-time Multiple Object Tracking
The main challenge of Multiple Object Tracking (MOT) is the efficiency in
associating indefinite number of objects between video frames. Standard motion
estimators used in tracking, e.g., Long Short Term Memory (LSTM), only deal
with single object, while Re-IDentification (Re-ID) based approaches
exhaustively compare object appearances. Both approaches are computationally
costly when they are scaled to a large number of objects, making it very
difficult for real-time MOT. To address these problems, we propose a highly
efficient Deep Neural Network (DNN) that simultaneously models association
among indefinite number of objects. The inference computation of the DNN does
not increase with the number of objects. Our approach, Frame-wise Motion and
Appearance (FMA), computes the Frame-wise Motion Fields (FMF) between two
frames, which leads to very fast and reliable matching among a large number of
object bounding boxes. As auxiliary information is used to fix uncertain
matches, Frame-wise Appearance Features (FAF) are learned in parallel with
FMFs. Extensive experiments on the MOT17 benchmark show that our method
achieved real-time MOT with competitive results as the state-of-the-art
approaches.Comment: 13 pages, 4 figures, 4 table
Kernalised Multi-resolution Convnet for Visual Tracking
Visual tracking is intrinsically a temporal problem. Discriminative
Correlation Filters (DCF) have demonstrated excellent performance for
high-speed generic visual object tracking. Built upon their seminal work, there
has been a plethora of recent improvements relying on convolutional neural
network (CNN) pretrained on ImageNet as a feature extractor for visual
tracking. However, most of their works relying on ad hoc analysis to design the
weights for different layers either using boosting or hedging techniques as an
ensemble tracker. In this paper, we go beyond the conventional DCF framework
and propose a Kernalised Multi-resolution Convnet (KMC) formulation that
utilises hierarchical response maps to directly output the target movement.
When directly deployed the learnt network to predict the unseen challenging UAV
tracking dataset without any weight adjustment, the proposed model consistently
achieves excellent tracking performance. Moreover, the transfered
multi-reslution CNN renders it possible to be integrated into the RNN temporal
learning framework, therefore opening the door on the end-to-end temporal deep
learning (TDL) for visual tracking.Comment: CVPRW 201
Track Everything: Limiting Prior Knowledge in Online Multi-Object Recognition
This paper addresses the problem of online tracking and classification of
multiple objects in an image sequence. Our proposed solution is to first track
all objects in the scene without relying on object-specific prior knowledge,
which in other systems can take the form of hand-crafted features or user-based
track initialization. We then classify the tracked objects with a fast-learning
image classifier that is based on a shallow convolutional neural network
architecture and demonstrate that object recognition improves when this is
combined with object state information from the tracking algorithm. We argue
that by transferring the use of prior knowledge from the detection and tracking
stages to the classification stage we can design a robust, general purpose
object recognition system with the ability to detect and track a variety of
object types. We describe our biologically inspired implementation, which
adaptively learns the shape and motion of tracked objects, and apply it to the
Neovision2 Tower benchmark data set, which contains multiple object types. An
experimental evaluation demonstrates that our approach is competitive with
state-of-the-art video object recognition systems that do make use of
object-specific prior knowledge in detection and tracking, while providing
additional practical advantages by virtue of its generality.Comment: 15 page
- …