49,466 research outputs found
ReXCam: Resource-Efficient, Cross-Camera Video Analytics at Scale
Enterprises are increasingly deploying large camera networks for video
analytics. Many target applications entail a common problem template: searching
for and tracking an object or activity of interest (e.g. a speeding vehicle, a
break-in) through a large camera network in live video. Such cross-camera
analytics is compute and data intensive, with cost growing with the number of
cameras and time. To address this cost challenge, we present ReXCam, a new
system for efficient cross-camera video analytics. ReXCam exploits spatial and
temporal locality in the dynamics of real camera networks to guide its
inference-time search for a query identity. In an offline profiling phase,
ReXCam builds a cross-camera correlation model that encodes the locality
observed in historical traffic patterns. At inference time, ReXCam applies this
model to filter frames that are not spatially and temporally correlated with
the query identity's current position. In the cases of occasional missed
detections, ReXCam performs a fast-replay search on recently filtered video
frames, enabling gracefully recovery. Together, these techniques allow ReXCam
to reduce compute workload by 8.3x on an 8-camera dataset, and by 23x - 38x on
a simulated 130-camera dataset. ReXCam has been implemented and deployed on a
testbed of 5 AWS DeepLens cameras.Comment: 15 page
Robust Visual Tracking using Multi-Frame Multi-Feature Joint Modeling
It remains a huge challenge to design effective and efficient trackers under
complex scenarios, including occlusions, illumination changes and pose
variations. To cope with this problem, a promising solution is to integrate the
temporal consistency across consecutive frames and multiple feature cues in a
unified model. Motivated by this idea, we propose a novel correlation
filter-based tracker in this work, in which the temporal relatedness is
reconciled under a multi-task learning framework and the multiple feature cues
are modeled using a multi-view learning approach. We demonstrate the resulting
regression model can be efficiently learned by exploiting the structure of
blockwise diagonal matrix. A fast blockwise diagonal matrix inversion algorithm
is developed thereafter for efficient online tracking. Meanwhile, we
incorporate an adaptive scale estimation mechanism to strengthen the stability
of scale variation tracking. We implement our tracker using two types of
features and test it on two benchmark datasets. Experimental results
demonstrate the superiority of our proposed approach when compared with other
state-of-the-art trackers. project homepage
http://bmal.hust.edu.cn/project/KMF2JMTtracking.htmlComment: This paper has been accepted by IEEE Transactions on Circuits and
Systems for Video Technology. The MATLAB code of our method is available from
our project homepage http://bmal.hust.edu.cn/project/KMF2JMTtracking.htm
CityFlow: A City-Scale Benchmark for Multi-Target Multi-Camera Vehicle Tracking and Re-Identification
Urban traffic optimization using traffic cameras as sensors is driving the
need to advance state-of-the-art multi-target multi-camera (MTMC) tracking.
This work introduces CityFlow, a city-scale traffic camera dataset consisting
of more than 3 hours of synchronized HD videos from 40 cameras across 10
intersections, with the longest distance between two simultaneous cameras being
2.5 km. To the best of our knowledge, CityFlow is the largest-scale dataset in
terms of spatial coverage and the number of cameras/videos in an urban
environment. The dataset contains more than 200K annotated bounding boxes
covering a wide range of scenes, viewing angles, vehicle models, and urban
traffic flow conditions. Camera geometry and calibration information are
provided to aid spatio-temporal analysis. In addition, a subset of the
benchmark is made available for the task of image-based vehicle
re-identification (ReID). We conducted an extensive experimental evaluation of
baselines/state-of-the-art approaches in MTMC tracking, multi-target
single-camera (MTSC) tracking, object detection, and image-based ReID on this
dataset, analyzing the impact of different network architectures, loss
functions, spatio-temporal models and their combinations on task effectiveness.
An evaluation server is launched with the release of our benchmark at the 2019
AI City Challenge (https://www.aicitychallenge.org/) that allows researchers to
compare the performance of their newest techniques. We expect this dataset to
catalyze research in this field, propel the state-of-the-art forward, and lead
to deployed traffic optimization(s) in the real world.Comment: Accepted for oral presentation at CVPR 2019 with review ratings of 2
strong accepts and 1 accept (work done during an internship at NVIDIA
Intelligent Intersection: Two-Stream Convolutional Networks for Real-time Near Accident Detection in Traffic Video
In Intelligent Transportation System, real-time systems that monitor and
analyze road users become increasingly critical as we march toward the smart
city era. Vision-based frameworks for Object Detection, Multiple Object
Tracking, and Traffic Near Accident Detection are important applications of
Intelligent Transportation System, particularly in video surveillance and etc.
Although deep neural networks have recently achieved great success in many
computer vision tasks, a uniformed framework for all the three tasks is still
challenging where the challenges multiply from demand for real-time
performance, complex urban setting, highly dynamic traffic event, and many
traffic movements. In this paper, we propose a two-stream Convolutional Network
architecture that performs real-time detection, tracking, and near accident
detection of road users in traffic video data. The two-stream model consists of
a spatial stream network for Object Detection and a temporal stream network to
leverage motion features for Multiple Object Tracking. We detect near accidents
by incorporating appearance features and motion features from two-stream
networks. Using aerial videos, we propose a Traffic Near Accident Dataset
(TNAD) covering various types of traffic interactions that is suitable for
vision-based traffic analysis tasks. Our experiments demonstrate the advantage
of our framework with an overall competitive qualitative and quantitative
performance at high frame rates on the TNAD dataset.Comment: Submitted to ACM Transactions on Spatial Algorithms and Systems
(TSAS); Special issue on Urban Mobility: Algorithms and Systems. arXiv admin
note: text overlap with arXiv:1703.07402 by other author
Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor
In this paper, we focus on the two key aspects of multiple target tracking
problem: 1) designing an accurate affinity measure to associate detections and
2) implementing an efficient and accurate (near) online multiple target
tracking algorithm. As the first contribution, we introduce a novel Aggregated
Local Flow Descriptor (ALFD) that encodes the relative motion pattern between a
pair of temporally distant detections using long term interest point
trajectories (IPTs). Leveraging on the IPTs, the ALFD provides a robust
affinity measure for estimating the likelihood of matching detections
regardless of the application scenarios. As another contribution, we present a
Near-Online Multi-target Tracking (NOMT) algorithm. The tracking problem is
formulated as a data-association between targets and detections in a temporal
window, that is performed repeatedly at every frame. While being efficient,
NOMT achieves robustness via integrating multiple cues including ALFD metric,
target dynamics, appearance similarity, and long term trajectory regularization
into the model. Our ablative analysis verifies the superiority of the ALFD
metric over the other conventional affinity metrics. We run a comprehensive
experimental evaluation on two challenging tracking datasets, KITTI and MOT
datasets. The NOMT method combined with ALFD metric achieves the best accuracy
in both datasets with significant margins (about 10% higher MOTA) over the
state-of-the-arts
Multiple Hypothesis Tracking Algorithm for Multi-Target Multi-Camera Tracking with Disjoint Views
In this study, a multiple hypothesis tracking (MHT) algorithm for
multi-target multi-camera tracking (MCT) with disjoint views is proposed. Our
method forms track-hypothesis trees, and each branch of them represents a
multi-camera track of a target that may move within a camera as well as move
across cameras. Furthermore, multi-target tracking within a camera is performed
simultaneously with the tree formation by manipulating a status of each track
hypothesis. Each status represents three different stages of a multi-camera
track: tracking, searching, and end-of-track. The tracking status means targets
are tracked by a single camera tracker. In the searching status, the
disappeared targets are examined if they reappear in other cameras. The
end-of-track status does the target exited the camera network due to its
lengthy invisibility. These three status assists MHT to form the
track-hypothesis trees for multi-camera tracking. Furthermore, they present a
gating technique for eliminating of unlikely observation-to-track association.
In the experiments, they evaluate the proposed method using two datasets,
DukeMTMC and NLPR-MCT, which demonstrates that the proposed method outperforms
the state-of-the-art method in terms of improvement of the accuracy. In
addition, they show that the proposed method can operate in real-time and
online.Comment: published in IET image processing, 201
Generic Multiview Visual Tracking
Recent progresses in visual tracking have greatly improved the tracking
performance. However, challenges such as occlusion and view change remain
obstacles in real world deployment. A natural solution to these challenges is
to use multiple cameras with multiview inputs, though existing systems are
mostly limited to specific targets (e.g. human), static cameras, and/or camera
calibration. To break through these limitations, we propose a generic multiview
tracking (GMT) framework that allows camera movement, while requiring neither
specific object model nor camera calibration. A key innovation in our framework
is a cross-camera trajectory prediction network (TPN), which implicitly and
dynamically encodes camera geometric relations, and hence addresses missing
target issues such as occlusion. Moreover, during tracking, we assemble
information across different cameras to dynamically update a novel
collaborative correlation filter (CCF), which is shared among cameras to
achieve robustness against view change. The two components are integrated into
a correlation filter tracking framework, where the features are trained offline
using existing single view tracking datasets. For evaluation, we first
contribute a new generic multiview tracking dataset (GMTD) with careful
annotations, and then run experiments on GMTD and the PETS2009 datasets. On
both datasets, the proposed GMT algorithm shows clear advantages over
state-of-the-art ones
Deformable Distributed Multiple Detector Fusion for Multi-Person Tracking
This paper addresses fully automated multi-person tracking in complex
environments with challenging occlusion and extensive pose variations. Our
solution combines multiple detectors for a set of different regions of interest
(e.g., full-body and head) for multi-person tracking. The use of multiple
detectors leads to fewer miss detections as it is able to exploit the
complementary strengths of the individual detectors. While the number of false
positives may increase with the increased number of bounding boxes detected
from multiple detectors, we propose to group the detection outputs by bounding
box location and depth information. For robustness to significant pose
variations, deformable spatial relationship between detectors are learnt in our
multi-person tracking system. On RGBD data from a live Intensive Care Unit
(ICU), we show that the proposed method significantly improves multi-person
tracking performance over state-of-the-art methods
Spatiotemporal KSVD Dictionary Learning for Online Multi-target Tracking
In this paper, we present a new spatial discriminative KSVD dictionary
algorithm (STKSVD) for learning target appearance in online multi-target
tracking. Different from other classification/recognition tasks (e.g. face,
image recognition), learning target's appearance in online multi-target
tracking is impacted by factors such as posture/articulation changes, partial
occlusion by background scene or other targets, background changes (human
detection bounding box covers human parts and part of the scene), etc. However,
we observe that these variations occur gradually relative to spatial and
temporal dynamics. We characterize the spatial and temporal information between
target's samples through a new STKSVD appearance learning algorithm to better
discriminate sparse code, linear classifier parameters and minimize
reconstruction error in a single optimization system. Our appearance learning
algorithm and tracking framework employ two different methods of calculating
appearance similarity score in each stage of a two-stage association: a linear
classifier in the first stage, and minimum residual errors in the second stage.
The results tested using 2DMOT2015 dataset and its public Aggregated Channel
features (ACF) human detection for all comparisons show that our method
outperforms the existing related learning methods.Comment: To appear in Proceedings of 15th Conference on Computer and Robot
Vision 2018 (Oral
A Survey Of Activity Recognition And Understanding The Behavior In Video Survelliance
This paper presents a review of human activity recognition and behaviour
understanding in video sequence. The key objective of this paper is to provide
a general review on the overall process of a surveillance system used in the
current trend. Visual surveillance system is directed on automatic
identification of events of interest, especially on tracking and classification
of moving objects. The processing step of the video surveillance system
includes the following stages: Surrounding model, object representation, object
tracking, activity recognition and behaviour understanding. It describes
techniques that use to define a general set of activities that are applicable
to a wide range of scenes and environments in video sequence.Comment: 14 pages, 5 figures, 5 table
- …