Search CORE

68,454 research outputs found

Spatial-Temporal Relation Networks for Multi-Object Tracking

Author: Cao Yue
Hu Han
Xu Jiarui
Zhang Zheng
Publication venue
Publication date: 25/04/2019
Field of study

Recent progress in multiple object tracking (MOT) has shown that a robust similarity score is key to the success of trackers. A good similarity score is expected to reflect multiple cues, e.g. appearance, location, and topology, over a long period of time. However, these cues are heterogeneous, making them hard to be combined in a unified network. As a result, existing methods usually encode them in separate networks or require a complex training approach. In this paper, we present a unified framework for similarity measurement which could simultaneously encode various cues and perform reasoning across both spatial and temporal domains. We also study the feature representation of a tracklet-object pair in depth, showing a proper design of the pair features can well empower the trackers. The resulting approach is named spatial-temporal relation networks (STRN). It runs in a feed-forward way and can be trained in an end-to-end manner. The state-of-the-art accuracy was achieved on all of the MOT15-17 benchmarks using public detection and online settings

arXiv.org e-Print Archive

Intelligent Intersection: Two-Stream Convolutional Networks for Real-time Near Accident Detection in Traffic Video

Author: He Pan
Huang Xiaohui
Rangarajan Anand
Ranka Sanjay
Publication venue
Publication date: 04/01/2019
Field of study

In Intelligent Transportation System, real-time systems that monitor and analyze road users become increasingly critical as we march toward the smart city era. Vision-based frameworks for Object Detection, Multiple Object Tracking, and Traffic Near Accident Detection are important applications of Intelligent Transportation System, particularly in video surveillance and etc. Although deep neural networks have recently achieved great success in many computer vision tasks, a uniformed framework for all the three tasks is still challenging where the challenges multiply from demand for real-time performance, complex urban setting, highly dynamic traffic event, and many traffic movements. In this paper, we propose a two-stream Convolutional Network architecture that performs real-time detection, tracking, and near accident detection of road users in traffic video data. The two-stream model consists of a spatial stream network for Object Detection and a temporal stream network to leverage motion features for Multiple Object Tracking. We detect near accidents by incorporating appearance features and motion features from two-stream networks. Using aerial videos, we propose a Traffic Near Accident Dataset (TNAD) covering various types of traffic interactions that is suitable for vision-based traffic analysis tasks. Our experiments demonstrate the advantage of our framework with an overall competitive qualitative and quantitative performance at high frame rates on the TNAD dataset.Comment: Submitted to ACM Transactions on Spatial Algorithms and Systems (TSAS); Special issue on Urban Mobility: Algorithms and Systems. arXiv admin note: text overlap with arXiv:1703.07402 by other author

arXiv.org e-Print Archive

Deep Affinity Network for Multiple Object Tracking

Author: Akhtar Naveed
Mian Ajmal
Shah Mubarak
Song HuanSheng
Sun ShiJie
Publication venue
Publication date: 15/07/2019
Field of study

Multiple Object Tracking (MOT) plays an important role in solving many fundamental problems in video analysis in computer vision. Most MOT methods employ two steps: Object Detection and Data Association. The first step detects objects of interest in every frame of a video, and the second establishes correspondence between the detected objects in different frames to obtain their tracks. Object detection has made tremendous progress in the last few years due to deep learning. However, data association for tracking still relies on hand crafted constraints such as appearance, motion, spatial proximity, grouping etc. to compute affinities between the objects in different frames. In this paper, we harness the power of deep learning for data association in tracking by jointly modelling object appearances and their affinities between different frames in an end-to-end fashion. The proposed Deep Affinity Network (DAN) learns compact; yet comprehensive features of pre-detected objects at several levels of abstraction, and performs exhaustive pairing permutations of those features in any two frames to infer object affinities. DAN also accounts for multiple objects appearing and disappearing between video frames. We exploit the resulting efficient affinity computations to associate objects in the current frame deep into the previous frames for reliable on-line tracking. Our technique is evaluated on popular multiple object tracking challenges MOT15, MOT17 and UA-DETRAC. Comprehensive benchmarking under twelve evaluation metrics demonstrates that our approach is among the best performing techniques on the leader board for these challenges. The open source implementation of our work is available at https://github.com/shijieS/SST.git.Comment: To appear in IEEE TPAM

arXiv.org e-Print Archive

T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos

Author: Kang Kai
Li Hongsheng
Ouyang Wanli
Wang Ruohui
Wang Xiaogang
Wang Zhe
Xiao Tong
Yan Junjie
Yang Bin
Zeng Xingyu
Zhang Cong
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 03/08/2017
Field of study

The state-of-the-art performance for object detection has been significantly improved over the past two years. Besides the introduction of powerful deep neural networks such as GoogleNet and VGG, novel object detection frameworks such as R-CNN and its successors, Fast R-CNN and Faster R-CNN, play an essential role in improving the state-of-the-art. Despite their effectiveness on still images, those frameworks are not specifically designed for object detection from videos. Temporal and contextual information of videos are not fully investigated and utilized. In this work, we propose a deep learning framework that incorporates temporal and contextual information from tubelets obtained in videos, which dramatically improves the baseline performance of existing still-image detection frameworks when they are applied to videos. It is called T-CNN, i.e. tubelets with convolutional neueral networks. The proposed framework won the recently introduced object-detection-from-video (VID) task with provided data in the ImageNet Large-Scale Visual Recognition Challenge 2015 (ILSVRC2015).Comment: ImageNet 2015 VID challenge tech report. The first two authors share co-first authorship. Accepted as a Transaction paper by T-CSVT Special Issue on Large Scale and Nonlinear Similarity Learning for Intelligent Video Analysi

arXiv.org e-Print Archive

A Hybrid Data Association Framework for Robust Online Multi-Object Tracking

Author: Jia Yunde
Wu Yuwei
Yang Min
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 31/03/2017
Field of study

Global optimization algorithms have shown impressive performance in data-association based multi-object tracking, but handling online data remains a difficult hurdle to overcome. In this paper, we present a hybrid data association framework with a min-cost multi-commodity network flow for robust online multi-object tracking. We build local target-specific models interleaved with global optimization of the optimal data association over multiple video frames. More specifically, in the min-cost multi-commodity network flow, the target-specific similarities are online learned to enforce the local consistency for reducing the complexity of the global data association. Meanwhile, the global data association taking multiple video frames into account alleviates irrecoverable errors caused by the local data association between adjacent frames. To ensure the efficiency of online tracking, we give an efficient near-optimal solution to the proposed min-cost multi-commodity flow problem, and provide the empirical proof of its sub-optimality. The comprehensive experiments on real data demonstrate the superior tracking performance of our approach in various challenging situations

arXiv.org e-Print Archive

PointIT: A Fast Tracking Framework Based on 3D Instance Segmentation

Author: Liu Ming
Wang Yuan
Yu Yang
Publication venue
Publication date: 17/02/2019
Field of study

Recently most popular tracking frameworks focus on 2D image sequences. They seldom track the 3D object in point clouds. In this paper, we propose PointIT, a fast, simple tracking method based on 3D on-road instance segmentation. Firstly, we transform 3D LiDAR data into the spherical image with the size of 64 x 512 x 4 and feed it into instance segment model to get the predicted instance mask for each class. Then we use MobileNet as our primary encoder instead of the original ResNet to reduce the computational complexity. Finally, we extend the Sort algorithm with this instance framework to realize tracking in the 3D LiDAR point cloud data. The model is trained on the spherical images dataset with the corresponding instance label masks which are provided by KITTI 3D Object Track dataset. According to the experiment results, our network can achieve on Average Precision (AP) of 0.617 and the performance of multi-tracking task has also been improved

arXiv.org e-Print Archive

Learning a Robust Society of Tracking Parts using Co-occurrence Constraints

Author: Burceanu Elena
Leordeanu Marius
Publication venue
Publication date: 08/11/2018
Field of study

Object tracking is an essential problem in computer vision that has been researched for several decades. One of the main challenges in tracking is to adapt to object appearance changes over time and avoiding drifting to background clutter. We address this challenge by proposing a deep neural network composed of different parts, which functions as a society of tracking parts. They work in conjunction according to a certain policy and learn from each other in a robust manner, using co-occurrence constraints that ensure robust inference and learning. From a structural point of view, our network is composed of two main pathways. One pathway is more conservative. It carefully monitors a large set of simple tracker parts learned as linear filters over deep feature activation maps. It assigns the parts different roles. It promotes the reliable ones and removes the inconsistent ones. We learn these filters simultaneously in an efficient way, with a single closed-form formulation, for which we propose novel theoretical properties. The second pathway is more progressive. It is learned completely online and thus it is able to better model object appearance changes. In order to adapt in a robust manner, it is learned only on highly confident frames, which are decided using co-occurrences with the first pathway. Thus, our system has the full benefit of two main approaches in tracking. The larger set of simpler filter parts offers robustness, while the full deep network learned online provides adaptability to change. As shown in the experimental section, our approach achieves state of the art performance on the challenging VOT17 benchmark, outperforming the published methods both on the general EAO metric and in the number of fails, by a significant margin.Comment: 17+3 pages, 5 figures, European Conference on Computer Vision (ECCV), Visual Object Tracking worksho

arXiv.org e-Print Archive

Frame-wise Motion and Appearance for Real-time Multiple Object Tracking

Author: Huang Dong
Wang Jinjun
Zhang Jimuyang
Zhou Sanping
Publication venue
Publication date: 06/05/2019
Field of study

The main challenge of Multiple Object Tracking (MOT) is the efficiency in associating indefinite number of objects between video frames. Standard motion estimators used in tracking, e.g., Long Short Term Memory (LSTM), only deal with single object, while Re-IDentification (Re-ID) based approaches exhaustively compare object appearances. Both approaches are computationally costly when they are scaled to a large number of objects, making it very difficult for real-time MOT. To address these problems, we propose a highly efficient Deep Neural Network (DNN) that simultaneously models association among indefinite number of objects. The inference computation of the DNN does not increase with the number of objects. Our approach, Frame-wise Motion and Appearance (FMA), computes the Frame-wise Motion Fields (FMF) between two frames, which leads to very fast and reliable matching among a large number of object bounding boxes. As auxiliary information is used to fix uncertain matches, Frame-wise Appearance Features (FAF) are learned in parallel with FMFs. Extensive experiments on the MOT17 benchmark show that our method achieved real-time MOT with competitive results as the state-of-the-art approaches.Comment: 13 pages, 4 figures, 4 table

arXiv.org e-Print Archive

Kernalised Multi-resolution Convnet for Visual Tracking

Author: Li Xia
Wu Di
Zhao Yong
Zou Wenbin
Publication venue
Publication date: 01/08/2017
Field of study

Visual tracking is intrinsically a temporal problem. Discriminative Correlation Filters (DCF) have demonstrated excellent performance for high-speed generic visual object tracking. Built upon their seminal work, there has been a plethora of recent improvements relying on convolutional neural network (CNN) pretrained on ImageNet as a feature extractor for visual tracking. However, most of their works relying on ad hoc analysis to design the weights for different layers either using boosting or hedging techniques as an ensemble tracker. In this paper, we go beyond the conventional DCF framework and propose a Kernalised Multi-resolution Convnet (KMC) formulation that utilises hierarchical response maps to directly output the target movement. When directly deployed the learnt network to predict the unseen challenging UAV tracking dataset without any weight adjustment, the proposed model consistently achieves excellent tracking performance. Moreover, the transfered multi-reslution CNN renders it possible to be integrated into the RNN temporal learning framework, therefore opening the door on the end-to-end temporal deep learning (TDL) for visual tracking.Comment: CVPRW 201

arXiv.org e-Print Archive

Track Everything: Limiting Prior Knowledge in Online Multi-Object Recognition

Author: Gatt Adam
Kearney David
Lee Ivan
McDonnell Mark D.
Stamatescu Victor
Wong Sebastien C.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 21/04/2017
Field of study

This paper addresses the problem of online tracking and classification of multiple objects in an image sequence. Our proposed solution is to first track all objects in the scene without relying on object-specific prior knowledge, which in other systems can take the form of hand-crafted features or user-based track initialization. We then classify the tracked objects with a fast-learning image classifier that is based on a shallow convolutional neural network architecture and demonstrate that object recognition improves when this is combined with object state information from the tracking algorithm. We argue that by transferring the use of prior knowledge from the detection and tracking stages to the classification stage we can design a robust, general purpose object recognition system with the ability to detect and track a variety of object types. We describe our biologically inspired implementation, which adaptively learns the shape and motion of tracked objects, and apply it to the Neovision2 Tower benchmark data set, which contains multiple object types. An experimental evaluation demonstrates that our approach is competitive with state-of-the-art video object recognition systems that do make use of object-specific prior knowledge in detection and tracking, while providing additional practical advantages by virtue of its generality.Comment: 15 page

arXiv.org e-Print Archive