9,829 research outputs found
Predictive Coding for Dynamic Visual Processing: Development of Functional Hierarchy in a Multiple Spatio-Temporal Scales RNN Model
The current paper proposes a novel predictive coding type neural network
model, the predictive multiple spatio-temporal scales recurrent neural network
(P-MSTRNN). The P-MSTRNN learns to predict visually perceived human whole-body
cyclic movement patterns by exploiting multiscale spatio-temporal constraints
imposed on network dynamics by using differently sized receptive fields as well
as different time constant values for each layer. After learning, the network
becomes able to proactively imitate target movement patterns by inferring or
recognizing corresponding intentions by means of the regression of prediction
error. Results show that the network can develop a functional hierarchy by
developing a different type of dynamic structure at each layer. The paper
examines how model performance during pattern generation as well as predictive
imitation varies depending on the stage of learning. The number of limit cycle
attractors corresponding to target movement patterns increases as learning
proceeds. And, transient dynamics developing early in the learning process
successfully perform pattern generation and predictive imitation tasks. The
paper concludes that exploitation of transient dynamics facilitates successful
task performance during early learning periods.Comment: Accepted in Neural Computation (MIT press
Distributed video coding for wireless video sensor networks: a review of the state-of-the-art architectures
Distributed video coding (DVC) is a relatively new video coding architecture originated from two fundamental theorems namely, Slepian–Wolf and Wyner–Ziv. Recent research developments have made DVC attractive for applications in the emerging domain of wireless video sensor networks (WVSNs). This paper reviews the state-of-the-art DVC architectures with a focus on understanding their opportunities and gaps in addressing the operational requirements and application needs of WVSNs
Unsupervised Video Analysis Based on a Spatiotemporal Saliency Detector
Visual saliency, which predicts regions in the field of view that draw the
most visual attention, has attracted a lot of interest from researchers. It has
already been used in several vision tasks, e.g., image classification, object
detection, foreground segmentation. Recently, the spectrum analysis based
visual saliency approach has attracted a lot of interest due to its simplicity
and good performance, where the phase information of the image is used to
construct the saliency map. In this paper, we propose a new approach for
detecting spatiotemporal visual saliency based on the phase spectrum of the
videos, which is easy to implement and computationally efficient. With the
proposed algorithm, we also study how the spatiotemporal saliency can be used
in two important vision task, abnormality detection and spatiotemporal interest
point detection. The proposed algorithm is evaluated on several commonly used
datasets with comparison to the state-of-art methods from the literature. The
experiments demonstrate the effectiveness of the proposed approach to
spatiotemporal visual saliency detection and its application to the above
vision tasksComment: 21 page
Deep Reference Generation with Multi-Domain Hierarchical Constraints for Inter Prediction
Inter prediction is an important module in video coding for temporal
redundancy removal, where similar reference blocks are searched from previously
coded frames and employed to predict the block to be coded. Although
traditional video codecs can estimate and compensate for block-level motions,
their inter prediction performance is still heavily affected by the remaining
inconsistent pixel-wise displacement caused by irregular rotation and
deformation. In this paper, we address the problem by proposing a deep frame
interpolation network to generate additional reference frames in coding
scenarios. First, we summarize the previous adaptive convolutions used for
frame interpolation and propose a factorized kernel convolutional network to
improve the modeling capacity and simultaneously keep its compact form. Second,
to better train this network, multi-domain hierarchical constraints are
introduced to regularize the training of our factorized kernel convolutional
network. For spatial domain, we use a gradually down-sampled and up-sampled
auto-encoder to generate the factorized kernels for frame interpolation at
different scales. For quality domain, considering the inconsistent quality of
the input frames, the factorized kernel convolution is modulated with
quality-related features to learn to exploit more information from high quality
frames. For frequency domain, a sum of absolute transformed difference loss
that performs frequency transformation is utilized to facilitate network
optimization from the view of coding performance. With the well-designed frame
interpolation network regularized by multi-domain hierarchical constraints, our
method surpasses HEVC on average 6.1% BD-rate saving and up to 11.0% BD-rate
saving for the luma component under the random access configuration
Mining for meaning: from vision to language through multiple networks consensus
Describing visual data into natural language is a very challenging task, at
the intersection of computer vision, natural language processing and machine
learning. Language goes well beyond the description of physical objects and
their interactions and can convey the same abstract idea in many ways. It is
both about content at the highest semantic level as well as about fluent form.
Here we propose an approach to describe videos in natural language by reaching
a consensus among multiple encoder-decoder networks. Finding such a consensual
linguistic description, which shares common properties with a larger group, has
a better chance to convey the correct meaning. We propose and train several
network architectures and use different types of image, audio and video
features. Each model produces its own description of the input video and the
best one is chosen through an efficient, two-phase consensus process. We
demonstrate the strength of our approach by obtaining state of the art results
on the challenging MSR-VTT dataset.Comment: Accepted at BMVC 201
Anomaly Detection and Localization in Crowded Scenes by Motion-field Shape Description and Similarity-based Statistical Learning
In crowded scenes, detection and localization of abnormal behaviors is
challenging in that high-density people make object segmentation and tracking
extremely difficult. We associate the optical flows of multiple frames to
capture short-term trajectories and introduce the histogram-based shape
descriptor referred to as shape contexts to describe such short-term
trajectories. Furthermore, we propose a K-NN similarity-based statistical model
to detect anomalies over time and space, which is an unsupervised one-class
learning algorithm requiring no clustering nor any prior assumption. Firstly,
we retrieve the K-NN samples from the training set in regard to the testing
sample, and then use the similarities between every pair of the K-NN samples to
construct a Gaussian model. Finally, the probabilities of the similarities from
the testing sample to the K-NN samples under the Gaussian model are calculated
in the form of a joint probability. Abnormal events can be detected by judging
whether the joint probability is below predefined thresholds in terms of time
and space, separately. Such a scheme can adapt to the whole scene, since the
probability computed as such is not affected by motion distortions arising from
perspective distortion. We conduct experiments on real-world surveillance
videos, and the results demonstrate that the proposed method can reliably
detect and locate the abnormal events in the video sequences, outperforming the
state-of-the-art approaches
Detection of Unknown Anomalies in Streaming Videos with Generative Energy-based Boltzmann Models
Abnormal event detection is one of the important objectives in research and
practical applications of video surveillance. However, there are still three
challenging problems for most anomaly detection systems in practical setting:
limited labeled data, ambiguous definition of "abnormal" and expensive feature
engineering steps. This paper introduces a unified detection framework to
handle these challenges using energy-based models, which are powerful tools for
unsupervised representation learning. Our proposed models are firstly trained
on unlabeled raw pixels of image frames from an input video rather than
hand-crafted visual features; and then identify the locations of abnormal
objects based on the errors between the input video and its reconstruction
produced by the models. To handle video stream, we develop an online version of
our framework, wherein the model parameters are updated incrementally with the
image frames arriving on the fly. Our experiments show that our detectors,
using Restricted Boltzmann Machines (RBMs) and Deep Boltzmann Machines (DBMs)
as core modules, achieve superior anomaly detection performance to unsupervised
baselines and obtain accuracy comparable with the state-of-the-art approaches
when evaluating at the pixel-level. More importantly, we discover that our
system trained with DBMs is able to simultaneously perform scene clustering and
scene reconstruction. This capacity not only distinguishes our method from
other existing detectors but also offers a unique tool to investigate and
understand how the model works.Comment: This manuscript is under consideration at Pattern Recognition Letter
Unsupervised Learning from Video with Deep Neural Embeddings
Because of the rich dynamical structure of videos and their ubiquity in
everyday life, it is a natural idea that video data could serve as a powerful
unsupervised learning signal for training visual representations in deep neural
networks. However, instantiating this idea, especially at large scale, has
remained a significant artificial intelligence challenge. Here we present the
Video Instance Embedding (VIE) framework, which extends powerful recent
unsupervised loss functions for learning deep nonlinear embeddings to
multi-stream temporal processing architectures on large-scale video datasets.
We show that VIE-trained networks substantially advance the state of the art in
unsupervised learning from video datastreams, both for action recognition in
the Kinetics dataset, and object recognition in the ImageNet dataset. We show
that a hybrid model with both static and dynamic processing pathways is optimal
for both transfer tasks, and provide analyses indicating how the pathways
differ. Taken in context, our results suggest that deep neural embeddings are a
promising approach to unsupervised visual learning across a wide variety of
domains.Comment: To appear in CVPR 202
Video Summarization with Attention-Based Encoder-Decoder Networks
This paper addresses the problem of supervised video summarization by
formulating it as a sequence-to-sequence learning problem, where the input is a
sequence of original video frames, the output is a keyshot sequence. Our key
idea is to learn a deep summarization network with attention mechanism to mimic
the way of selecting the keyshots of human. To this end, we propose a novel
video summarization framework named Attentive encoder-decoder networks for
Video Summarization (AVS), in which the encoder uses a Bidirectional Long
Short-Term Memory (BiLSTM) to encode the contextual information among the input
video frames. As for the decoder, two attention-based LSTM networks are
explored by using additive and multiplicative objective functions,
respectively. Extensive experiments are conducted on three video summarization
benchmark datasets, i.e., SumMe, and TVSum. The results demonstrate the
superiority of the proposed AVS-based approaches against the state-of-the-art
approaches,with remarkable improvements from 0.8% to 3% on two
datasets,respectively..Comment: 9 pages, 7 figure
Deep Predictive Video Compression with Bi-directional Prediction
Recently, deep image compression has shown a big progress in terms of coding
efficiency and image quality improvement. However, relatively less attention
has been put on video compression using deep learning networks. In the paper,
we first propose a deep learning based bi-predictive coding network, called
BP-DVC Net, for video compression. Learned from the lesson of the conventional
video coding, a B-frame coding structure is incorporated in our BP-DVC Net.
While the bi-predictive coding in the conventional video codecs requires to
transmit to decoder sides the motion vectors for block motion and the residues
from prediction, our BP-DVC Net incorporates optical flow estimation networks
in both encoder and decoder sides so as not to transmit the motion information
to the decoder sides for coding efficiency improvement. Also, a bi-prediction
network in the BP-DVC Net is proposed and used to precisely predict the current
frame and to yield the resulting residues as small as possible. Furthermore,
our BP-DVC Net allows for the compressive feature maps to be entropy-coded
using the temporal context among the feature maps of adjacent frames. The
BP-DVC Net has an end-to-end video compression architecture with newly designed
flow and prediction losses. Experimental results show that the compression
performance of our proposed method is comparable to those of H.264, HEVC in
terms of PSNR and MS-SSIM
- …