2,044 research outputs found
Flow-Guided Feature Aggregation for Video Object Detection
Extending state-of-the-art object detectors from image to video is
challenging. The accuracy of detection suffers from degenerated object
appearances in videos, e.g., motion blur, video defocus, rare poses, etc.
Existing work attempts to exploit temporal information on box level, but such
methods are not trained end-to-end. We present flow-guided feature aggregation,
an accurate and end-to-end learning framework for video object detection. It
leverages temporal coherence on feature level instead. It improves the
per-frame features by aggregation of nearby features along the motion paths,
and thus improves the video recognition accuracy. Our method significantly
improves upon strong single-frame baselines in ImageNet VID, especially for
more challenging fast moving objects. Our framework is principled, and on par
with the best engineered systems winning the ImageNet VID challenges 2016,
without additional bells-and-whistles. The proposed method, together with Deep
Feature Flow, powered the winning entry of ImageNet VID challenges 2017. The
code is available at
https://github.com/msracver/Flow-Guided-Feature-Aggregation
Deformable Object Tracking with Gated Fusion
The tracking-by-detection framework receives growing attentions through the
integration with the Convolutional Neural Networks (CNNs). Existing
tracking-by-detection based methods, however, fail to track objects with severe
appearance variations. This is because the traditional convolutional operation
is performed on fixed grids, and thus may not be able to find the correct
response while the object is changing pose or under varying environmental
conditions. In this paper, we propose a deformable convolution layer to enrich
the target appearance representations in the tracking-by-detection framework.
We aim to capture the target appearance variations via deformable convolution,
which adaptively enhances its original features. In addition, we also propose a
gated fusion scheme to control how the variations captured by the deformable
convolution affect the original appearance. The enriched feature representation
through deformable convolution facilitates the discrimination of the CNN
classifier on the target object and background. Extensive experiments on the
standard benchmarks show that the proposed tracker performs favorably against
state-of-the-art methods
News Session-Based Recommendations using Deep Neural Networks
News recommender systems are aimed to personalize users experiences and help
them to discover relevant articles from a large and dynamic search space.
Therefore, news domain is a challenging scenario for recommendations, due to
its sparse user profiling, fast growing number of items, accelerated item's
value decay, and users preferences dynamic shift. Some promising results have
been recently achieved by the usage of Deep Learning techniques on Recommender
Systems, specially for item's feature extraction and for session-based
recommendations with Recurrent Neural Networks. In this paper, it is proposed
an instantiation of the CHAMELEON -- a Deep Learning Meta-Architecture for News
Recommender Systems. This architecture is composed of two modules, the first
responsible to learn news articles representations, based on their text and
metadata, and the second module aimed to provide session-based recommendations
using Recurrent Neural Networks. The recommendation task addressed in this work
is next-item prediction for users sessions: "what is the next most likely
article a user might read in a session?" Users sessions context is leveraged by
the architecture to provide additional information in such extreme cold-start
scenario of news recommendation. Users' behavior and item features are both
merged in an hybrid recommendation approach. A temporal offline evaluation
method is also proposed as a complementary contribution, for a more realistic
evaluation of such task, considering dynamic factors that affect global
readership interests like popularity, recency, and seasonality. Experiments
with an extensive number of session-based recommendation methods were performed
and the proposed instantiation of CHAMELEON meta-architecture obtained a
significant relative improvement in top-n accuracy and ranking metrics (10% on
Hit Rate and 13% on MRR) over the best benchmark methods.Comment: Accepted for the Third Workshop on Deep Learning for Recommender
Systems - DLRS 2018, October 02-07, 2018, Vancouver, Canada.
https://recsys.acm.org/recsys18/dlrs
Fast recursive filters for simulating nonlinear dynamic systems
A fast and accurate computational scheme for simulating nonlinear dynamic
systems is presented. The scheme assumes that the system can be represented by
a combination of components of only two different types: first-order low-pass
filters and static nonlinearities. The parameters of these filters and
nonlinearities may depend on system variables, and the topology of the system
may be complex, including feedback. Several examples taken from neuroscience
are given: phototransduction, photopigment bleaching, and spike generation
according to the Hodgkin-Huxley equations. The scheme uses two slightly
different forms of autoregressive filters, with an implicit delay of zero for
feedforward control and an implicit delay of half a sample distance for
feedback control. On a fairly complex model of the macaque retinal horizontal
cell it computes, for a given level of accuracy, 1-2 orders of magnitude faster
than 4th-order Runge-Kutta. The computational scheme has minimal memory
requirements, and is also suited for computation on a stream processor, such as
a GPU (Graphical Processing Unit).Comment: 20 pages, 8 figures, 1 table. A comparison with 4th-order Runge-Kutta
integration shows that the new algorithm is 1-2 orders of magnitude faster.
The paper is in press now at Neural Computatio
NERV++: An Enhanced Implicit Neural Video Representation
Neural fields, also known as implicit neural representations (INRs), have
shown a remarkable capability of representing, generating, and manipulating
various data types, allowing for continuous data reconstruction at a low memory
footprint. Though promising, INRs applied to video compression still need to
improve their rate-distortion performance by a large margin, and require a huge
number of parameters and long training iterations to capture high-frequency
details, limiting their wider applicability. Resolving this problem remains a
quite challenging task, which would make INRs more accessible in compression
tasks. We take a step towards resolving these shortcomings by introducing
neural representations for videos NeRV++, an enhanced implicit neural video
representation, as more straightforward yet effective enhancement over the
original NeRV decoder architecture, featuring separable conv2d residual blocks
(SCRBs) that sandwiches the upsampling block (UB), and a bilinear interpolation
skip layer for improved feature representation. NeRV++ allows videos to be
directly represented as a function approximated by a neural network, and
significantly enhance the representation capacity beyond current INR-based
video codecs. We evaluate our method on UVG, MCL JVC, and Bunny datasets,
achieving competitive results for video compression with INRs. This achievement
narrows the gap to autoencoder-based video coding, marking a significant stride
in INR-based video compression research
- …