8 research outputs found
FCN-rLSTM: Deep Spatio-Temporal Neural Networks for Vehicle Counting in City Cameras
In this paper, we develop deep spatio-temporal neural networks to
sequentially count vehicles from low quality videos captured by city cameras
(citycams). Citycam videos have low resolution, low frame rate, high occlusion
and large perspective, making most existing methods lose their efficacy. To
overcome limitations of existing methods and incorporate the temporal
information of traffic video, we design a novel FCN-rLSTM network to jointly
estimate vehicle density and vehicle count by connecting fully convolutional
neural networks (FCN) with long short term memory networks (LSTM) in a residual
learning fashion. Such design leverages the strengths of FCN for pixel-level
prediction and the strengths of LSTM for learning complex temporal dynamics.
The residual learning connection reformulates the vehicle count regression as
learning residual functions with reference to the sum of densities in each
frame, which significantly accelerates the training of networks. To preserve
feature map resolution, we propose a Hyper-Atrous combination to integrate
atrous convolution in FCN and combine feature maps of different convolution
layers. FCN-rLSTM enables refined feature representation and a novel end-to-end
trainable mapping from pixels to vehicle count. We extensively evaluated the
proposed method on different counting tasks with three datasets, with
experimental results demonstrating their effectiveness and robustness. In
particular, FCN-rLSTM reduces the mean absolute error (MAE) from 5.31 to 4.21
on TRANCOS, and reduces the MAE from 2.74 to 1.53 on WebCamT. Training process
is accelerated by 5 times on average.Comment: Accepted by International Conference on Computer Vision (ICCV), 201
Understanding Traffic Density from Large-Scale Web Camera Data
Understanding traffic density from large-scale web camera (webcam) videos is
a challenging problem because such videos have low spatial and temporal
resolution, high occlusion and large perspective. To deeply understand traffic
density, we explore both deep learning based and optimization based methods. To
avoid individual vehicle detection and tracking, both methods map the image
into vehicle density map, one based on rank constrained regression and the
other one based on fully convolution networks (FCN). The regression based
method learns different weights for different blocks in the image to increase
freedom degrees of weights and embed perspective information. The FCN based
method jointly estimates vehicle density map and vehicle count with a residual
learning framework to perform end-to-end dense prediction, allowing arbitrary
image resolution, and adapting to different vehicle scales and perspectives. We
analyze and compare both methods, and get insights from optimization based
method to improve deep model. Since existing datasets do not cover all the
challenges in our work, we collected and labelled a large-scale traffic video
dataset, containing 60 million frames from 212 webcams. Both methods are
extensively evaluated and compared on different counting tasks and datasets.
FCN based method significantly reduces the mean absolute error from 10.99 to
5.31 on the public dataset TRANCOS compared with the state-of-the-art baseline.Comment: Accepted by CVPR 2017. Preprint version was uploaded on
http://welcome.isr.tecnico.ulisboa.pt/publications/understanding-traffic-density-from-large-scale-web-camera-data
ParGANDA: Making Synthetic Pedestrians A Reality For Object Detection
Object detection is the key technique to a number of Computer Vision
applications, but it often requires large amounts of annotated data to achieve
decent results. Moreover, for pedestrian detection specifically, the collected
data might contain some personally identifiable information (PII), which is
highly restricted in many countries. This label intensive and privacy
concerning task has recently led to an increasing interest in training the
detection models using synthetically generated pedestrian datasets collected
with a photo-realistic video game engine. The engine is able to generate
unlimited amounts of data with precise and consistent annotations, which gives
potential for significant gains in the real-world applications. However, the
use of synthetic data for training introduces a synthetic-to-real domain shift
aggravating the final performance. To close the gap between the real and
synthetic data, we propose to use a Generative Adversarial Network (GAN), which
performsparameterized unpaired image-to-image translation to generate more
realistic images. The key benefit of using the GAN is its intrinsic preference
of low-level changes to geometric ones, which means annotations of a given
synthetic image remain accurate even after domain translation is performed thus
eliminating the need for labeling real data. We extensively experimented with
the proposed method using MOTSynth dataset to train and MOT17 and MOT20
detection datasets to test, with experimental results demonstrating the
effectiveness of this method. Our approach not only produces visually plausible
samples but also does not require any labels of the real domain thus making it
applicable to the variety of downstream tasks
TDT: Teaching Detectors to Track without Fully Annotated Videos
Recently, one-stage trackers that use a joint model to predict both
detections and appearance embeddings in one forward pass received much
attention and achieved state-of-the-art results on the Multi-Object Tracking
(MOT) benchmarks. However, their success depends on the availability of videos
that are fully annotated with tracking data, which is expensive and hard to
obtain. This can limit the model generalization. In comparison, the two-stage
approach, which performs detection and embedding separately, is slower but
easier to train as their data are easier to annotate. We propose to combine the
best of the two worlds through a data distillation approach. Specifically, we
use a teacher embedder, trained on Re-ID datasets, to generate pseudo
appearance embedding labels for the detection datasets. Then, we use the
augmented dataset to train a detector that is also capable of regressing these
pseudo-embeddings in a fully-convolutional fashion. Our proposed one-stage
solution matches the two-stage counterpart in quality but is 3 times faster.
Even though the teacher embedder has not seen any tracking data during
training, our proposed tracker achieves competitive performance with some
popular trackers (e.g. JDE) trained with fully labeled tracking data.Comment: Workshop on Learning with Limited Labelled Data for Image and Video
Understanding (L3D-IVU), CVPR2022 Worksho
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP
Context R-CNN: Long Term Temporal Context for Per-Camera Object Detection
In static monitoring cameras, useful contextual information can stretch far beyond the few seconds typical video understanding models might see: subjects may exhibit similar behavior over multiple days, and background objects remain static. Due to power and storage constraints, sampling frequencies are low, often no faster than one frame per second, and sometimes are irregular due to the use of a motion trigger. In order to perform well in this setting, models must be robust to irregular sampling rates. In this paper we propose a method that leverages temporal context from the unlabeled frames of a novel camera to improve performance at that camera. Specifically, we propose an attention-based approach that allows our model, Context R-CNN, to index into a long term memory bank constructed on a per-camera basis and aggregate contextual features from other frames to boost object detection performance on the current frame. We apply Context R-CNN to two settings: (1) species detection using camera traps, and (2) vehicle detection in traffic cameras, showing in both settings that Context R-CNN leads to performance gains over strong baselines. Moreover, we show that increasing the contextual time horizon leads to improved results. When applied to camera trap data from the Snapshot Serengeti dataset, Context R-CNN with context from up to a month of images outperforms a single-frame baseline by 17.9% mAP, and outperforms S3D (a 3d convolution based baseline) by 11.2% mAP