1,070 research outputs found
Delving Deeper into Convolutional Networks for Learning Video Representations
We propose an approach to learn spatio-temporal features in videos from
intermediate visual representations we call "percepts" using
Gated-Recurrent-Unit Recurrent Networks (GRUs).Our method relies on percepts
that are extracted from all level of a deep convolutional network trained on
the large ImageNet dataset. While high-level percepts contain highly
discriminative information, they tend to have a low-spatial resolution.
Low-level percepts, on the other hand, preserve a higher spatial resolution
from which we can model finer motion patterns. Using low-level percepts can
leads to high-dimensionality video representations. To mitigate this effect and
control the model number of parameters, we introduce a variant of the GRU model
that leverages the convolution operations to enforce sparse connectivity of the
model units and share parameters across the input spatial locations.
We empirically validate our approach on both Human Action Recognition and
Video Captioning tasks. In particular, we achieve results equivalent to
state-of-art on the YouTube2Text dataset using a simpler text-decoder model and
without extra 3D CNN features.Comment: ICLR 201
Predictive spatio-temporal modelling with neural networks
Hongbin Liu studied the predictive spatio-temporal modelling using Neural Networks. Predictive spatio-temporal modelling is a challenge task due to the complex non-linear spatio-temporal dependencies, data sparsity and uncertainty.
Hongbin Liu investigated the modelling difficulties and proposed three novel models to tackle the difficulties for three common spatio-temporal datasets. He also conducted extensive experiments on several real-world datasets for various spatio-temporal prediction tasks, such as travel mode classification, next-location prediction, weather forecasting and meteorological imagery prediction. The results show our proposed models consistently achieve exceptional improvements over state-of-the-art baselines
Learning Video Object Segmentation with Visual Memory
This paper addresses the task of segmenting moving objects in unconstrained
videos. We introduce a novel two-stream neural network with an explicit memory
module to achieve this. The two streams of the network encode spatial and
temporal features in a video sequence respectively, while the memory module
captures the evolution of objects over time. The module to build a "visual
memory" in video, i.e., a joint representation of all the video frames, is
realized with a convolutional recurrent unit learned from a small number of
training video sequences. Given a video frame as input, our approach assigns
each pixel an object or background label based on the learned spatio-temporal
features as well as the "visual memory" specific to the video, acquired
automatically without any manually-annotated frames. The visual memory is
implemented with convolutional gated recurrent units, which allows to propagate
spatial information over time. We evaluate our method extensively on two
benchmarks, DAVIS and Freiburg-Berkeley motion segmentation datasets, and show
state-of-the-art results. For example, our approach outperforms the top method
on the DAVIS dataset by nearly 6%. We also provide an extensive ablative
analysis to investigate the influence of each component in the proposed
framework
Estimator: An Effective and Scalable Framework for Transportation Mode Classification over Trajectories
Transportation mode classification, the process of predicting the class
labels of moving objects transportation modes, has been widely applied to a
variety of real world applications, such as traffic management, urban
computing, and behavior study. However, existing studies of transportation mode
classification typically extract the explicit features of trajectory data but
fail to capture the implicit features that affect the classification
performance. In addition, most of the existing studies also prefer to apply
RNN-based models to embed trajectories, which is only suitable for classifying
small-scale data. To tackle the above challenges, we propose an effective and
scalable framework for transportation mode classification over GPS
trajectories, abbreviated Estimator. Estimator is established on a developed
CNN-TCN architecture, which is capable of leveraging the spatial and temporal
hidden features of trajectories to achieve high effectiveness and efficiency.
Estimator partitions the entire traffic space into disjointed spatial regions
according to traffic conditions, which enhances the scalability significantly
and thus enables parallel transportation classification. Extensive experiments
using eight public real-life datasets offer evidence that Estimator i) achieves
superior model effectiveness (i.e., 99% Accuracy and 0.98 F1-score), which
outperforms state-of-the-arts substantially; ii) exhibits prominent model
efficiency, and obtains 7-40x speedups up over state-of-the-arts learning-based
methods; and iii) shows high model scalability and robustness that enables
large-scale classification analytics.Comment: 12 pages, 8 figure
Spatio-Temporal Image Boundary Extrapolation
Boundary prediction in images as well as video has been a very active topic
of research and organizing visual information into boundaries and segments is
believed to be a corner stone of visual perception. While prior work has
focused on predicting boundaries for observed frames, our work aims at
predicting boundaries of future unobserved frames. This requires our model to
learn about the fate of boundaries and extrapolate motion patterns. We
experiment on established real-world video segmentation dataset, which provides
a testbed for this new task. We show for the first time spatio-temporal
boundary extrapolation in this challenging scenario. Furthermore, we show
long-term prediction of boundaries in situations where the motion is governed
by the laws of physics. We successfully predict boundaries in a billiard
scenario without any assumptions of a strong parametric model or any object
notion. We argue that our model has with minimalistic model assumptions derived
a notion of 'intuitive physics' that can be applied to novel scenes
- …