3,640 research outputs found
Semi-Supervised Domain Adaptation for Weakly Labeled Semantic Video Object Segmentation
Deep convolutional neural networks (CNNs) have been immensely successful in
many high-level computer vision tasks given large labeled datasets. However,
for video semantic object segmentation, a domain where labels are scarce,
effectively exploiting the representation power of CNN with limited training
data remains a challenge. Simply borrowing the existing pretrained CNN image
recognition model for video segmentation task can severely hurt performance. We
propose a semi-supervised approach to adapting CNN image recognition model
trained from labeled image data to the target domain exploiting both semantic
evidence learned from CNN, and the intrinsic structures of video data. By
explicitly modeling and compensating for the domain shift from the source
domain to the target domain, this proposed approach underpins a robust semantic
object segmentation method against the changes in appearance, shape and
occlusion in natural videos. We present extensive experiments on challenging
datasets that demonstrate the superior performance of our approach compared
with the state-of-the-art methods
Retinal Vessel Segmentation under Extreme Low Annotation: A Generative Adversarial Network Approach
Contemporary deep learning based medical image segmentation algorithms
require hours of annotation labor by domain experts. These data hungry deep
models perform sub-optimally in the presence of limited amount of labeled data.
In this paper, we present a data efficient learning framework using the recent
concept of Generative Adversarial Networks; this allows a deep neural network
to perform significantly better than its fully supervised counterpart in low
annotation regime. The proposed method is an extension of our previous work
with the addition of a new unsupervised adversarial loss and a structured
prediction based architecture. To the best of our knowledge, this work is the
first demonstration of an adversarial framework based structured prediction
model for medical image segmentation. Though generic, we apply our method for
segmentation of blood vessels in retinal fundus images. We experiment with
extreme low annotation budget (0.8 - 1.6% of contemporary annotation size). On
DRIVE and STARE datasets, the proposed method outperforms our previous method
and other fully supervised benchmark models by significant margins especially
with very low number of annotated examples. In addition, our systematic
ablation studies suggest some key recipes for successfully training GAN based
semi-supervised algorithms with an encoder-decoder style network architecture.Comment: * First 3 authors contributed equall
Self-Learning for Player Localization in Sports Video
This paper introduces a novel self-learning framework that automates the
label acquisition process for improving models for detecting players in
broadcast footage of sports games. Unlike most previous self-learning
approaches for improving appearance-based object detectors from videos, we
allow an unknown, unconstrained number of target objects in a more generalized
video sequence with non-static camera views. Our self-learning approach uses a
latent SVM learning algorithm and deformable part models to represent the shape
and colour information of players, constraining their motions, and learns the
colour of the playing field by a gentle Adaboost algorithm. We combine those
image cues and discover additional labels automatically from unlabelled data.
In our experiments, our approach exploits both labelled and unlabelled data in
sparsely labelled videos of sports games, providing a mean performance
improvement of over 20% in the average precision for detecting sports players
and improved tracking, when videos contain very few labelled images
Multi-Stream Dynamic Video Summarization
With vast amounts of video content being uploaded to the Internet every
minute, video summarization becomes critical for efficient browsing, searching,
and indexing of visual content. Nonetheless, the spread of social and
egocentric cameras creates an abundance of sparse scenarios captured by several
devices, and ultimately required to be jointly summarized. In this paper, we
discuss the problem of summarizing videos recorded simultaneously by several
dynamic cameras that intermittently share the field of view. We present a
robust framework that (a) identifies a diverse set of important events among
moving cameras that often are not capturing the same scene, and (b) selects the
most representative view(s) at each event to be included in a universal
summary. Due to the lack of an applicable alternative, we collected a new
multi-view egocentric dataset, Multi-Ego. Our dataset is recorded
simultaneously by three cameras, covering a wide variety of real-life
scenarios. The footage is annotated by multiple individuals under various
summarization configurations, with a consensus analysis ensuring a reliable
ground truth. We conduct extensive experiments on the compiled dataset in
addition to three other standard benchmarks that show the robustness and the
advantage of our approach in both supervised and unsupervised settings.
Additionally, we show that our approach learns collectively from data of varied
number-of-views and orthogonal to other summarization methods, deeming it
scalable and generic. Our materials are made publicly available
Unsupervised Category Discovery via Looped Deep Pseudo-Task Optimization Using a Large Scale Radiology Image Database
Obtaining semantic labels on a large scale radiology image database (215,786
key images from 61,845 unique patients) is a prerequisite yet bottleneck to
train highly effective deep convolutional neural network (CNN) models for image
recognition. Nevertheless, conventional methods for collecting image labels
(e.g., Google search followed by crowd-sourcing) are not applicable due to the
formidable difficulties of medical annotation tasks for those who are not
clinically trained. This type of image labeling task remains non-trivial even
for radiologists due to uncertainty and possible drastic inter-observer
variation or inconsistency.
In this paper, we present a looped deep pseudo-task optimization procedure
for automatic category discovery of visually coherent and clinically semantic
(concept) clusters. Our system can be initialized by domain-specific (CNN
trained on radiology images and text report derived labels) or generic
(ImageNet based) CNN models. Afterwards, a sequence of pseudo-tasks are
exploited by the looped deep image feature clustering (to refine image labels)
and deep CNN training/classification using new labels (to obtain more task
representative deep features). Our method is conceptually simple and based on
the hypothesized "convergence" of better labels leading to better trained CNN
models which in turn feed more effective deep image features to facilitate more
meaningful clustering/labels. We have empirically validated the convergence and
demonstrated promising quantitative and qualitative results. Category labels of
significantly higher quality than those in previous work are discovered. This
allows for further investigation of the hierarchical semantic nature of the
given large-scale radiology image database
MaskRNN: Instance Level Video Object Segmentation
Instance level video object segmentation is an important technique for video
editing and compression. To capture the temporal coherence, in this paper, we
develop MaskRNN, a recurrent neural net approach which fuses in each frame the
output of two deep nets for each object instance -- a binary segmentation net
providing a mask and a localization net providing a bounding box. Due to the
recurrent component and the localization component, our method is able to take
advantage of long-term temporal structures of the video data as well as
rejecting outliers. We validate the proposed algorithm on three challenging
benchmark datasets, the DAVIS-2016 dataset, the DAVIS-2017 dataset, and the
Segtrack v2 dataset, achieving state-of-the-art performance on all of them.Comment: Accepted to NIPS 201
Yes, we GAN: Applying Adversarial Techniques for Autonomous Driving
Generative Adversarial Networks (GAN) have gained a lot of popularity from
their introduction in 2014 till present. Research on GAN is rapidly growing and
there are many variants of the original GAN focusing on various aspects of deep
learning. GAN are perceived as the most impactful direction of machine learning
in the last decade. This paper focuses on the application of GAN in autonomous
driving including topics such as advanced data augmentation, loss function
learning, semi-supervised learning, etc. We formalize and review key
applications of adversarial techniques and discuss challenges and open problems
to be addressed.Comment: Accepted for publication in Electronic Imaging, Autonomous Vehicles
and Machines 2019. arXiv admin note: text overlap with arXiv:1606.05908 by
other author
WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving
Fisheye cameras are commonly employed for obtaining a large field of view in
surveillance, augmented reality and in particular automotive applications. In
spite of their prevalence, there are few public datasets for detailed
evaluation of computer vision algorithms on fisheye images. We release the
first extensive fisheye automotive dataset, WoodScape, named after Robert Wood
who invented the fisheye camera in 1906. WoodScape comprises of four surround
view cameras and nine tasks including segmentation, depth estimation, 3D
bounding box detection and soiling detection. Semantic annotation of 40 classes
at the instance level is provided for over 10,000 images and annotation for
other tasks are provided for over 100,000 images. With WoodScape, we would like
to encourage the community to adapt computer vision models for fisheye camera
instead of using naive rectification.Comment: Accepted for Oral Presentation at IEEE International Conference on
Computer Vision (ICCV) 2019. Please refer to our website
https://woodscape.valeo.com and https://github.com/valeoai/woodscape for
release status and update
Human-Centered Autonomous Vehicle Systems: Principles of Effective Shared Autonomy
Building effective, enjoyable, and safe autonomous vehicles is a lot harder
than has historically been considered. The reason is that, simply put, an
autonomous vehicle must interact with human beings. This interaction is not a
robotics problem nor a machine learning problem nor a psychology problem nor an
economics problem nor a policy problem. It is all of these problems put into
one. It challenges our assumptions about the limitations of human beings at
their worst and the capabilities of artificial intelligence systems at their
best. This work proposes a set of principles for designing and building
autonomous vehicles in a human-centered way that does not run away from the
complexity of human nature but instead embraces it. We describe our development
of the Human-Centered Autonomous Vehicle (HCAV) as an illustrative case study
of implementing these principles in practice
Facial Landmark Detection: a Literature Survey
The locations of the fiducial facial landmark points around facial components
and facial contour capture the rigid and non-rigid facial deformations due to
head movements and facial expressions. They are hence important for various
facial analysis tasks. Many facial landmark detection algorithms have been
developed to automatically detect those key points over the years, and in this
paper, we perform an extensive review of them. We classify the facial landmark
detection algorithms into three major categories: holistic methods, Constrained
Local Model (CLM) methods, and the regression-based methods. They differ in the
ways to utilize the facial appearance and shape information. The holistic
methods explicitly build models to represent the global facial appearance and
shape information. The CLMs explicitly leverage the global shape model but
build the local appearance models. The regression-based methods implicitly
capture facial shape and appearance information. For algorithms within each
category, we discuss their underlying theories as well as their differences. We
also compare their performances on both controlled and in the wild benchmark
datasets, under varying facial expressions, head poses, and occlusion. Based on
the evaluations, we point out their respective strengths and weaknesses. There
is also a separate section to review the latest deep learning-based algorithms.
The survey also includes a listing of the benchmark databases and existing
software. Finally, we identify future research directions, including combining
methods in different categories to leverage their respective strengths to solve
landmark detection "in-the-wild"
- …