725 research outputs found
Unsupervised Learning for Optical Flow Estimation Using Pyramid Convolution LSTM
Most of current Convolution Neural Network (CNN) based methods for optical
flow estimation focus on learning optical flow on synthetic datasets with
groundtruth, which is not practical. In this paper, we propose an unsupervised
optical flow estimation framework named PCLNet. It uses pyramid Convolution
LSTM (ConvLSTM) with the constraint of adjacent frame reconstruction, which
allows flexibly estimating multi-frame optical flows from any video clip.
Besides, by decoupling motion feature learning and optical flow representation,
our method avoids complex short-cut connections used in existing frameworks
while improving accuracy of optical flow estimation. Moreover, different from
those methods using specialized CNN architectures for capturing motion, our
framework directly learns optical flow from the features of generic CNNs and
thus can be easily embedded in any CNN based frameworks for other tasks.
Extensive experiments have verified that our method not only estimates optical
flow effectively and accurately, but also obtains comparable performance on
action recognition.Comment: IEEE International Conference on Multimedia and Expo(ICME). 201
Self-supervised Visual Feature Learning with Deep Neural Networks: A Survey
Large-scale labeled data are generally required to train deep neural networks
in order to obtain better performance in visual feature learning from images or
videos for computer vision applications. To avoid extensive cost of collecting
and annotating large-scale datasets, as a subset of unsupervised learning
methods, self-supervised learning methods are proposed to learn general image
and video features from large-scale unlabeled data without using any
human-annotated labels. This paper provides an extensive review of deep
learning-based self-supervised general visual feature learning methods from
images or videos. First, the motivation, general pipeline, and terminologies of
this field are described. Then the common deep neural network architectures
that used for self-supervised learning are summarized. Next, the main
components and evaluation metrics of self-supervised learning methods are
reviewed followed by the commonly used image and video datasets and the
existing self-supervised visual feature learning methods. Finally, quantitative
performance comparisons of the reviewed methods on benchmark datasets are
summarized and discussed for both image and video feature learning. At last,
this paper is concluded and lists a set of promising future directions for
self-supervised visual feature learning
Future Semantic Segmentation with Convolutional LSTM
We consider the problem of predicting semantic segmentation of future frames
in a video. Given several observed frames in a video, our goal is to predict
the semantic segmentation map of future frames that are not yet observed. A
reliable solution to this problem is useful in many applications that require
real-time decision making, such as autonomous driving. We propose a novel model
that uses convolutional LSTM (ConvLSTM) to encode the spatiotemporal
information of observed frames for future prediction. We also extend our model
to use bidirectional ConvLSTM to capture temporal information in both
directions. Our proposed approach outperforms other state-of-the-art methods on
the benchmark dataset.Comment: Accepted to BMVC 201
Autonomous Driving with Deep Learning: A Survey of State-of-Art Technologies
Since DARPA Grand Challenges (rural) in 2004/05 and Urban Challenges in 2007,
autonomous driving has been the most active field of AI applications. Almost at
the same time, deep learning has made breakthrough by several pioneers, three
of them (also called fathers of deep learning), Hinton, Bengio and LeCun, won
ACM Turin Award in 2019. This is a survey of autonomous driving technologies
with deep learning methods. We investigate the major fields of self-driving
systems, such as perception, mapping and localization, prediction, planning and
control, simulation, V2X and safety etc. Due to the limited space, we focus the
analysis on several key areas, i.e. 2D and 3D object detection in perception,
depth estimation from cameras, multiple sensor fusion on the data, feature and
task level respectively, behavior modelling and prediction of vehicle driving
and pedestrian trajectories
Direct detection of pixel-level myocardial infarction areas via a deep-learning algorithm
Accurate detection of the myocardial infarction (MI) area is crucial for
early diagnosis planning and follow-up management. In this study, we propose an
end-to-end deep-learning algorithm framework (OF-RNN ) to accurately detect the
MI area at the pixel level. Our OF-RNN consists of three different function
layers: the heart localization layers, which can accurately and automatically
crop the region-of-interest (ROI) sequences, including the left ventricle,
using the whole cardiac magnetic resonance image sequences; the motion
statistical layers, which are used to build a time-series architecture to
capture two types of motion features (at the pixel-level) by integrating the
local motion features generated by long short-term memory-recurrent neural
networks and the global motion features generated by deep optical flows from
the whole ROI sequence, which can effectively characterize myocardial
physiologic function; and the fully connected discriminate layers, which use
stacked auto-encoders to further learn these features, and they use a softmax
classifier to build the correspondences from the motion features to the tissue
identities (infarction or not) for each pixel. Through the seamless connection
of each layer, our OF-RNN can obtain the area, position, and shape of the MI
for each patient. Our proposed framework yielded an overall classification
accuracy of 94.35% at the pixel level, from 114 clinical subjects. These
results indicate the potential of our proposed method in aiding standardized MI
assessments
Dual Motion GAN for Future-Flow Embedded Video Prediction
Future frame prediction in videos is a promising avenue for unsupervised
video representation learning. Video frames are naturally generated by the
inherent pixel flows from preceding frames based on the appearance and motion
dynamics in the video. However, existing methods focus on directly
hallucinating pixel values, resulting in blurry predictions. In this paper, we
develop a dual motion Generative Adversarial Net (GAN) architecture, which
learns to explicitly enforce future-frame predictions to be consistent with the
pixel-wise flows in the video through a dual-learning mechanism. The primal
future-frame prediction and dual future-flow prediction form a closed loop,
generating informative feedback signals to each other for better video
prediction. To make both synthesized future frames and flows indistinguishable
from reality, a dual adversarial training method is proposed to ensure that the
future-flow prediction is able to help infer realistic future-frames, while the
future-frame prediction in turn leads to realistic optical flows. Our dual
motion GAN also handles natural motion uncertainty in different pixel locations
with a new probabilistic motion encoder, which is based on variational
autoencoders. Extensive experiments demonstrate that the proposed dual motion
GAN significantly outperforms state-of-the-art approaches on synthesizing new
video frames and predicting future flows. Our model generalizes well across
diverse visual scenes and shows superiority in unsupervised video
representation learning.Comment: ICCV 17 camera read
Recurrent Flow-Guided Semantic Forecasting
Understanding the world around us and making decisions about the future is a
critical component to human intelligence. As autonomous systems continue to
develop, their ability to reason about the future will be the key to their
success. Semantic anticipation is a relatively under-explored area for which
autonomous vehicles could take advantage of (e.g., forecasting pedestrian
trajectories). Motivated by the need for real-time prediction in autonomous
systems, we propose to decompose the challenging semantic forecasting task into
two subtasks: current frame segmentation and future optical flow prediction.
Through this decomposition, we built an efficient, effective, low overhead
model with three main components: flow prediction network, feature-flow
aggregation LSTM, and end-to-end learnable warp layer. Our proposed method
achieves state-of-the-art accuracy on short-term and moving objects semantic
forecasting while simultaneously reducing model parameters by up to 95% and
increasing efficiency by greater than 40x.Comment: 10 pages, 5 figures, 8 tables, Accepted to WACV 201
Deep Learning for Physical Processes: Incorporating Prior Scientific Knowledge
We consider the use of Deep Learning methods for modeling complex phenomena
like those occurring in natural physical processes. With the large amount of
data gathered on these phenomena the data intensive paradigm could begin to
challenge more traditional approaches elaborated over the years in fields like
maths or physics. However, despite considerable successes in a variety of
application domains, the machine learning field is not yet ready to handle the
level of complexity required by such problems. Using an example application,
namely Sea Surface Temperature Prediction, we show how general background
knowledge gained from physics could be used as a guideline for designing
efficient Deep Learning models. In order to motivate the approach and to assess
its generality we demonstrate a formal link between the solution of a class of
differential equations underlying a large family of physical phenomena and the
proposed model. Experiments and comparison with series of baselines including a
state of the art numerical approach is then provided
A Survey on Deep Learning Methods for Robot Vision
Deep learning has allowed a paradigm shift in pattern recognition, from using
hand-crafted features together with statistical classifiers to using
general-purpose learning procedures for learning data-driven representations,
features, and classifiers together. The application of this new paradigm has
been particularly successful in computer vision, in which the development of
deep learning methods for vision applications has become a hot research topic.
Given that deep learning has already attracted the attention of the robot
vision community, the main purpose of this survey is to address the use of deep
learning in robot vision. To achieve this, a comprehensive overview of deep
learning and its usage in computer vision is given, that includes a description
of the most frequently used neural models and their main application areas.
Then, the standard methodology and tools used for designing deep-learning based
vision systems are presented. Afterwards, a review of the principal work using
deep learning in robot vision is presented, as well as current and future
trends related to the use of deep learning in robotics. This survey is intended
to be a guide for the developers of robot vision systems
Handcrafted Local Features are Convolutional Neural Networks
Image and video classification research has made great progress through the
development of handcrafted local features and learning based features. These
two architectures were proposed roughly at the same time and have flourished at
overlapping stages of history. However, they are typically viewed as distinct
approaches. In this paper, we emphasize their structural similarities and show
how such a unified view helps us in designing features that balance efficiency
and effectiveness. As an example, we study the problem of designing efficient
video feature learning algorithms for action recognition.
We approach this problem by first showing that local handcrafted features and
Convolutional Neural Networks (CNNs) share the same convolution-pooling network
structure. We then propose a two-stream Convolutional ISA (ConvISA) that adopts
the convolution-pooling structure of the state-of-the-art handcrafted video
feature with greater modeling capacities and a cost-effective training
algorithm. Through custom designed network structures for pixels and optical
flow, our method also reflects distinctive characteristics of these two data
sources.
Our experimental results on standard action recognition benchmarks show that
by focusing on the structure of CNNs, rather than end-to-end training methods,
we are able to design an efficient and powerful video feature learning
algorithm
- …