252 research outputs found
Abnormal Event Detection in Videos using Spatiotemporal Autoencoder
We present an efficient method for detecting anomalies in videos. Recent
applications of convolutional neural networks have shown promises of
convolutional layers for object detection and recognition, especially in
images. However, convolutional neural networks are supervised and require
labels as learning signals. We propose a spatiotemporal architecture for
anomaly detection in videos including crowded scenes. Our architecture includes
two main components, one for spatial feature representation, and one for
learning the temporal evolution of the spatial features. Experimental results
on Avenue, Subway and UCSD benchmarks confirm that the detection accuracy of
our method is comparable to state-of-the-art methods at a considerable speed of
up to 140 fps
Simple vs complex temporal recurrences for video saliency prediction
This paper investigates modifying an existing neural network architecture for static saliency prediction using two types of recurrences that integrate information from the temporal domain. The first modification is the addition of a ConvLSTM within the architecture, while the second is a conceptually simple exponential moving average of an internal convolutional state. We use weights pre-trained on the SALICON dataset and fine-tune our model on DHF1K. Our results show that both modifications achieve state-of-the-art results and produce similar saliency maps. Source code is available at https://git.io/fjPiB
Looking Ahead: Anticipating Pedestrians Crossing with Future Frames Prediction
In this paper, we present an end-to-end future-prediction model that focuses
on pedestrian safety. Specifically, our model uses previous video frames,
recorded from the perspective of the vehicle, to predict if a pedestrian will
cross in front of the vehicle. The long term goal of this work is to design a
fully autonomous system that acts and reacts as a defensive human driver would
--- predicting future events and reacting to mitigate risk. We focus on
pedestrian-vehicle interactions because of the high risk of harm to the
pedestrian if their actions are miss-predicted. Our end-to-end model consists
of two stages: the first stage is an encoder/decoder network that learns to
predict future video frames. The second stage is a deep spatio-temporal network
that utilizes the predicted frames of the first stage to predict the
pedestrian's future action. Our system achieves state-of-the-art accuracy on
pedestrian behavior prediction and future frames prediction on the Joint
Attention for Autonomous Driving (JAAD) dataset
Video Prediction by Efficient Transformers
Video prediction is a challenging computer vision task that has a wide range
of applications. In this work, we present a new family of Transformer-based
models for video prediction. Firstly, an efficient local spatial-temporal
separation attention mechanism is proposed to reduce the complexity of standard
Transformers. Then, a full autoregressive model, a partial autoregressive model
and a non-autoregressive model are developed based on the new efficient
Transformer. The partial autoregressive model has a similar performance with
the full autoregressive model but a faster inference speed. The
non-autoregressive model not only achieves a faster inference speed but also
mitigates the quality degradation problem of the autoregressive counterparts,
but it requires additional parameters and loss function for learning. Given the
same attention mechanism, we conducted a comprehensive study to compare the
proposed three video prediction variants. Experiments show that the proposed
video prediction models are competitive with more complex state-of-the-art
convolutional-LSTM based models. The source code is available at
https://github.com/XiYe20/VPTR.Comment: Accepted by Image and Vision Computing. arXiv admin note: text
overlap with arXiv:2203.1583
- …