61,387 research outputs found
Early Action Prediction with Generative Adversarial Networks
Action Prediction is aimed to determine what action is occurring in a video
as early as possible, which is crucial to many online applications, such as
predicting a traffic accident before it happens and detecting malicious actions
in the monitoring system. In this work, we address this problem by developing
an end-to-end architecture that improves the discriminability of features of
partially observed videos by assimilating them to features from complete
videos. For this purpose, the generative adversarial network is introduced for
tackling action prediction problem, which improves the recognition accuracy of
partially observed videos though narrowing the feature difference of partially
observed videos from complete ones. Specifically, its generator comprises of
two networks: a CNN for feature extraction and an LSTM for estimating residual
error between features of the partially observed videos and complete ones, and
then the features from CNN adds the residual error from LSTM, which is regarded
as the enhanced feature to fool a competing discriminator. Meanwhile, the
generator is trained with an additional perceptual objective, which forces the
enhanced features of partially observed videos are discriminative enough for
action prediction. Extensive experimental results on UCF101, BIT and
UT-Interaction datasets demonstrate that our approach outperforms the
state-of-the-art methods, especially for videos that less than 50% portion of
frames is observed.Comment: IEEE Acces
Detecting Adversarial Attacks on Neural Network Policies with Visual Foresight
Deep reinforcement learning has shown promising results in learning control
policies for complex sequential decision-making tasks. However, these neural
network-based policies are known to be vulnerable to adversarial examples. This
vulnerability poses a potentially serious threat to safety-critical systems
such as autonomous vehicles. In this paper, we propose a defense mechanism to
defend reinforcement learning agents from adversarial attacks by leveraging an
action-conditioned frame prediction module. Our core idea is that the
adversarial examples targeting at a neural network-based policy are not
effective for the frame prediction model. By comparing the action distribution
produced by a policy from processing the current observed frame to the action
distribution produced by the same policy from processing the predicted frame
from the action-conditioned frame prediction module, we can detect the presence
of adversarial examples. Beyond detecting the presence of adversarial examples,
our method allows the agent to continue performing the task using the predicted
frame when the agent is under attack. We evaluate the performance of our
algorithm using five games in Atari 2600. Our results demonstrate that the
proposed defense mechanism achieves favorable performance against baseline
algorithms in detecting adversarial examples and in earning rewards when the
agents are under attack.Comment: Project page: http://yclin.me/RL_attack_detection/ Code:
https://github.com/yenchenlin/rl-attack-detectio
Session-based Sequential Skip Prediction via Recurrent Neural Networks
The focus of WSDM cup 2019 is session-based sequential skip prediction, i.e.
predicting whether users will skip tracks, given their immediately preceding
interactions in their listening session. This paper provides the solution of
our team \textbf{ekffar} to this challenge. We focus on
recurrent-neural-network-based deep learning approaches which have previously
been shown to perform well on session-based recommendation problems. We show
that by choosing an appropriate recurrent architecture that properly accounts
for the given information such as user interaction features and song metadata,
a single neural network could achieve a Mean Average Accuracy (AA) score of
0.648 on the withheld test data. Meanwhile, by ensembling several variants of
the core model, the overall recommendation accuracy can be improved even
further. By using the proposed approach, our team was able to attain the 1st
place in the competition. We have open-sourced our implementation at GitHub
Action-conditional Sequence Modeling for Recommendation
In many online applications interactions between a user and a web-service are
organized in a sequential way, e.g., user browsing an e-commerce website. In
this setting, recommendation system acts throughout user navigation by showing
items. Previous works have addressed this recommendation setup through the task
of predicting the next item user will interact with. In particular, Recurrent
Neural Networks (RNNs) has been shown to achieve substantial improvements over
collaborative filtering baselines. In this paper, we consider interactions
triggered by the recommendations of deployed recommender system in addition to
browsing behavior. Indeed, it is reported that in online services interactions
with recommendations represent up to 30\% of total interactions. Moreover, in
practice, recommender system can greatly influence user behavior by promoting
specific items. In this paper, we extend the RNN modeling framework by taking
into account user interaction with recommended items. We propose and evaluate
RNN architectures that consist of the recommendation action module and the
state-action fusion module. Using real-world large-scale datasets we
demonstrate improved performance on the next item prediction task compared to
the baselines
Human Action Recognition and Prediction: A Survey
Derived from rapid advances in computer vision and machine learning, video
analysis tasks have been moving from inferring the present state to predicting
the future state. Vision-based action recognition and prediction from videos
are such tasks, where action recognition is to infer human actions (present
state) based upon complete action executions, and action prediction to predict
human actions (future state) based upon incomplete action executions. These two
tasks have become particularly prevalent topics recently because of their
explosively emerging real-world applications, such as visual surveillance,
autonomous driving vehicle, entertainment, and video retrieval, etc. Many
attempts have been devoted in the last a few decades in order to build a robust
and effective framework for action recognition and prediction. In this paper,
we survey the complete state-of-the-art techniques in the action recognition
and prediction. Existing models, popular algorithms, technical difficulties,
popular action databases, evaluation protocols, and promising future directions
are also provided with systematic discussions
Shuffle and Learn: Unsupervised Learning using Temporal Order Verification
In this paper, we present an approach for learning a visual representation
from the raw spatiotemporal signals in videos. Our representation is learned
without supervision from semantic labels. We formulate our method as an
unsupervised sequential verification task, i.e., we determine whether a
sequence of frames from a video is in the correct temporal order. With this
simple task and no semantic labels, we learn a powerful visual representation
using a Convolutional Neural Network (CNN). The representation contains
complementary information to that learned from supervised image datasets like
ImageNet. Qualitative results show that our method captures information that is
temporally varying, such as human pose. When used as pre-training for action
recognition, our method gives significant gains over learning without external
data on benchmark datasets like UCF101 and HMDB51. To demonstrate its
sensitivity to human pose, we show results for pose estimation on the FLIC and
MPII datasets that are competitive, or better than approaches using
significantly more supervision. Our method can be combined with supervised
representations to provide an additional boost in accuracy.Comment: Accepted at ECCV 201
Attentive Crowd Flow Machines
Traffic flow prediction is crucial for urban traffic management and public
safety. Its key challenges lie in how to adaptively integrate the various
factors that affect the flow changes. In this paper, we propose a unified
neural network module to address this problem, called Attentive Crowd Flow
Machine~(ACFM), which is able to infer the evolution of the crowd flow by
learning dynamic representations of temporally-varying data with an attention
mechanism. Specifically, the ACFM is composed of two progressive ConvLSTM units
connected with a convolutional layer for spatial weight prediction. The first
LSTM takes the sequential flow density representation as input and generates a
hidden state at each time-step for attention map inference, while the second
LSTM aims at learning the effective spatial-temporal feature expression from
attentionally weighted crowd flow features. Based on the ACFM, we further build
a deep architecture with the application to citywide crowd flow prediction,
which naturally incorporates the sequential and periodic data as well as other
external influences. Extensive experiments on two standard benchmarks (i.e.,
crowd flow in Beijing and New York City) show that the proposed method achieves
significant improvements over the state-of-the-art methods.Comment: ACM MM, full pape
Prediction and Description of Near-Future Activities in Video
Most of the existing works on human activity analysis focus on recognition or
early recognition of the activity labels from complete or partial observations.
Similarly, existing video captioning approaches focus on the observed events in
videos. Predicting the labels and the captions of future activities where no
frames of the predicted activities have been observed is a challenging problem,
with important applications that require anticipatory response. In this work,
we propose a system that can infer the labels and the captions of a sequence of
future activities. Our proposed network for label prediction of a future
activity sequence is similar to a hybrid Siamese network with three branches
where the first branch takes visual features from the objects present in the
scene, the second branch takes observed activity features and the third branch
captures the last observed activity features. The predicted labels and the
observed scene context are then mapped to meaningful captions using a
sequence-to-sequence learning-based method. Experiments on three challenging
activity analysis datasets and a video description dataset demonstrate that
both our label prediction framework and captioning framework outperform the
state-of-the-arts.Comment: 14 pages, 4 figures, 14 table
Saliency-based Sequential Image Attention with Multiset Prediction
Humans process visual scenes selectively and sequentially using attention.
Central to models of human visual attention is the saliency map. We propose a
hierarchical visual architecture that operates on a saliency map and uses a
novel attention mechanism to sequentially focus on salient regions and take
additional glimpses within those regions. The architecture is motivated by
human visual attention, and is used for multi-label image classification on a
novel multiset task, demonstrating that it achieves high precision and recall
while localizing objects with its attention. Unlike conventional multi-label
image classification models, the model supports multiset prediction due to a
reinforcement-learning based training process that allows for arbitrary label
permutation and multiple instances per label.Comment: To appear in Advances in Neural Information Processing Systems 30
(NIPS 2017
Towards Automatic Learning of Procedures from Web Instructional Videos
The potential for agents, whether embodied or software, to learn by observing
other agents performing procedures involving objects and actions is rich.
Current research on automatic procedure learning heavily relies on action
labels or video subtitles, even during the evaluation phase, which makes them
infeasible in real-world scenarios. This leads to our question: can the
human-consensus structure of a procedure be learned from a large set of long,
unconstrained videos (e.g., instructional videos from YouTube) with only visual
evidence? To answer this question, we introduce the problem of procedure
segmentation--to segment a video procedure into category-independent procedure
segments. Given that no large-scale dataset is available for this problem, we
collect a large-scale procedure segmentation dataset with procedure segments
temporally localized and described; we use cooking videos and name the dataset
YouCook2. We propose a segment-level recurrent network for generating procedure
segments by modeling the dependencies across segments. The generated segments
can be used as pre-processing for other tasks, such as dense video captioning
and event parsing. We show in our experiments that the proposed model
outperforms competitive baselines in procedure segmentation.Comment: AAAI 2018 Camera-ready version. See http://youcook2.eecs.umich.edu
for YouCook2 datase
- …