997 research outputs found
Semi-Supervised Domain Adaptation for Weakly Labeled Semantic Video Object Segmentation
Deep convolutional neural networks (CNNs) have been immensely successful in
many high-level computer vision tasks given large labeled datasets. However,
for video semantic object segmentation, a domain where labels are scarce,
effectively exploiting the representation power of CNN with limited training
data remains a challenge. Simply borrowing the existing pretrained CNN image
recognition model for video segmentation task can severely hurt performance. We
propose a semi-supervised approach to adapting CNN image recognition model
trained from labeled image data to the target domain exploiting both semantic
evidence learned from CNN, and the intrinsic structures of video data. By
explicitly modeling and compensating for the domain shift from the source
domain to the target domain, this proposed approach underpins a robust semantic
object segmentation method against the changes in appearance, shape and
occlusion in natural videos. We present extensive experiments on challenging
datasets that demonstrate the superior performance of our approach compared
with the state-of-the-art methods
Universal Semi-Supervised Semantic Segmentation
In recent years, the need for semantic segmentation has arisen across several
different applications and environments. However, the expense and redundancy of
annotation often limits the quantity of labels available for training in any
domain, while deployment is easier if a single model works well across domains.
In this paper, we pose the novel problem of universal semi-supervised semantic
segmentation and propose a solution framework, to meet the dual needs of lower
annotation and deployment costs. In contrast to counterpoints such as fine
tuning, joint training or unsupervised domain adaptation, universal
semi-supervised segmentation ensures that across all domains: (i) a single
model is deployed, (ii) unlabeled data is used, (iii) performance is improved,
(iv) only a few labels are needed and (v) label spaces may differ. To address
this, we minimize supervised as well as within and cross-domain unsupervised
losses, introducing a novel feature alignment objective based on pixel-aware
entropy regularization for the latter. We demonstrate quantitative advantages
over other approaches on several combinations of segmentation datasets across
different geographies (Germany, England, India) and environments (outdoors,
indoors), as well as qualitative insights on the aligned representations.Comment: Accepted as poster presentation at ICCV 201
Weakly Supervised Adversarial Domain Adaptation for Semantic Segmentation in Urban Scenes
Semantic segmentation, a pixel-level vision task, is developed rapidly by
using convolutional neural networks (CNNs). Training CNNs requires a large
amount of labeled data, but manually annotating data is difficult. For
emancipating manpower, in recent years, some synthetic datasets are released.
However, they are still different from real scenes, which causes that training
a model on the synthetic data (source domain) cannot achieve a good performance
on real urban scenes (target domain). In this paper, we propose a weakly
supervised adversarial domain adaptation to improve the segmentation
performance from synthetic data to real scenes, which consists of three deep
neural networks. To be specific, a detection and segmentation ("DS" for short)
model focuses on detecting objects and predicting segmentation map; a
pixel-level domain classifier ("PDC" for short) tries to distinguish the image
features from which domains; an object-level domain classifier ("ODC" for
short) discriminates the objects from which domains and predicts the objects
classes. PDC and ODC are treated as the discriminators, and DS is considered as
the generator. By adversarial learning, DS is supposed to learn
domain-invariant features. In experiments, our proposed method yields the new
record of mIoU metric in the same problem.Comment: To appear at TI
Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes
During the last half decade, convolutional neural networks (CNNs) have
triumphed over semantic segmentation, which is one of the core tasks in many
applications such as autonomous driving. However, to train CNNs requires a
considerable amount of data, which is difficult to collect and laborious to
annotate. Recent advances in computer graphics make it possible to train CNNs
on photo-realistic synthetic imagery with computer-generated annotations.
Despite this, the domain mismatch between the real images and the synthetic
data cripples the models' performance. Hence, we propose a curriculum-style
learning approach to minimize the domain gap in urban scenery semantic
segmentation. The curriculum domain adaptation solves easy tasks first to infer
necessary properties about the target domain; in particular, the first task is
to learn global label distributions over images and local distributions over
landmark superpixels. These are easy to estimate because images of urban scenes
have strong idiosyncrasies (e.g., the size and spatial relations of buildings,
streets, cars, etc.). We then train a segmentation network while regularizing
its predictions in the target domain to follow those inferred properties. In
experiments, our method outperforms the baselines on two datasets and two
backbone networks. We also report extensive ablation studies about our
approach.Comment: This is the extended version of the ICCV 2017 paper "Curriculum
Domain Adaptation for Semantic Segmentation of Urban Scenes" with additional
GTA experimen
Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video
We explore object discovery and detector adaptation based on unlabeled video
sequences captured from a mobile platform. We propose a fully automatic
approach for object mining from video which builds upon a generic object
tracking approach. By applying this method to three large video datasets from
autonomous driving and mobile robotics scenarios, we demonstrate its robustness
and generality. Based on the object mining results, we propose a novel approach
for unsupervised object discovery by appearance-based clustering. We show that
this approach successfully discovers interesting objects relevant to driving
scenarios. In addition, we perform self-supervised detector adaptation in order
to improve detection performance on the KITTI dataset for existing categories.
Our approach has direct relevance for enabling large-scale object learning for
autonomous driving.Comment: CVPR'18 submissio
Self-Transfer Learning for Fully Weakly Supervised Object Localization
Recent advances of deep learning have achieved remarkable performances in
various challenging computer vision tasks. Especially in object localization,
deep convolutional neural networks outperform traditional approaches based on
extraction of data/task-driven features instead of hand-crafted features.
Although location information of region-of-interests (ROIs) gives good prior
for object localization, it requires heavy annotation efforts from human
resources. Thus a weakly supervised framework for object localization is
introduced. The term "weakly" means that this framework only uses image-level
labeled datasets to train a network. With the help of transfer learning which
adopts weight parameters of a pre-trained network, the weakly supervised
learning framework for object localization performs well because the
pre-trained network already has well-trained class-specific features. However,
those approaches cannot be used for some applications which do not have
pre-trained networks or well-localized large scale images. Medical image
analysis is a representative among those applications because it is impossible
to obtain such pre-trained networks. In this work, we present a "fully" weakly
supervised framework for object localization ("semi"-weakly is the counterpart
which uses pre-trained filters for weakly supervised localization) named as
self-transfer learning (STL). It jointly optimizes both classification and
localization networks simultaneously. By controlling a supervision level of the
localization network, STL helps the localization network focus on correct ROIs
without any types of priors. We evaluate the proposed STL framework using two
medical image datasets, chest X-rays and mammograms, and achieve signiticantly
better localization performance compared to previous weakly supervised
approaches.Comment: 9 pages, 4 figure
Label Efficient Learning of Transferable Representations across Domains and Tasks
We propose a framework that learns a representation transferable across
different domains and tasks in a label efficient manner. Our approach battles
domain shift with a domain adversarial loss, and generalizes the embedding to
novel task using a metric learning-based approach. Our model is simultaneously
optimized on labeled source data and unlabeled or sparsely labeled data in the
target domain. Our method shows compelling results on novel classes within a
new domain even when only a few labeled examples per class are available,
outperforming the prevalent fine-tuning approach. In addition, we demonstrate
the effectiveness of our framework on the transfer learning task from image
object recognition to video action recognition.Comment: NIPS 201
Cross-Domain Complementary Learning Using Pose for Multi-Person Part Segmentation
Supervised deep learning with pixel-wise training labels has great successes
on multi-person part segmentation. However, data labeling at pixel-level is
very expensive. To solve the problem, people have been exploring to use
synthetic data to avoid the data labeling. Although it is easy to generate
labels for synthetic data, the results are much worse compared to those using
real data and manual labeling. The degradation of the performance is mainly due
to the domain gap, i.e., the discrepancy of the pixel value statistics between
real and synthetic data. In this paper, we observe that real and synthetic
humans both have a skeleton (pose) representation. We found that the skeletons
can effectively bridge the synthetic and real domains during the training. Our
proposed approach takes advantage of the rich and realistic variations of the
real data and the easily obtainable labels of the synthetic data to learn
multi-person part segmentation on real images without any human-annotated
labels. Through experiments, we show that without any human labeling, our
method performs comparably to several state-of-the-art approaches which require
human labeling on Pascal-Person-Parts and COCO-DensePose datasets. On the other
hand, if part labels are also available in the real-images during training, our
method outperforms the supervised state-of-the-art methods by a large margin.
We further demonstrate the generalizability of our method on predicting novel
keypoints in real images where no real data labels are available for the novel
keypoints detection. Code and pre-trained models are available at
https://github.com/kevinlin311tw/CDCL-human-part-segmentationComment: To appear in IEEE Transactions on Circuits and Systems for Video
Technology; Presented at ICCV 2019 Demonstratio
OMNIA Faster R-CNN: Detection in the wild through dataset merging and soft distillation
Object detectors tend to perform poorly in new or open domains, and require
exhaustive yet costly annotations from fully labeled datasets. We aim at
benefiting from several datasets with different categories but without
additional labelling, not only to increase the number of categories detected,
but also to take advantage from transfer learning and to enhance domain
independence.
Our dataset merging procedure starts with training several initial Faster
R-CNN on the different datasets while considering the complementary datasets'
images for domain adaptation. Similarly to self-training methods, the
predictions of these initial detectors mitigate the missing annotations on the
complementary datasets. The final OMNIA Faster R-CNN is trained with all
categories on the union of the datasets enriched by predictions. The joint
training handles unsafe targets with a new classification loss called SoftSig
in a softly supervised way.
Experimental results show that in the case of fashion detection for images in
the wild, merging Modanet with COCO increases the final performance from 45.5%
to 57.4% in mAP. Applying our soft distillation to the task of detection with
domain shift between GTA and Cityscapes enables to beat the state-of-the-art by
5.3 points. Our methodology could unlock object detection for real-world
applications without immense datasets.Comment: 9 pages, 5 figures, 4 table
No More Discrimination: Cross City Adaptation of Road Scene Segmenters
Despite the recent success of deep-learning based semantic segmentation,
deploying a pre-trained road scene segmenter to a city whose images are not
presented in the training set would not achieve satisfactory performance due to
dataset biases. Instead of collecting a large number of annotated images of
each city of interest to train or refine the segmenter, we propose an
unsupervised learning approach to adapt road scene segmenters across different
cities. By utilizing Google Street View and its time-machine feature, we can
collect unannotated images for each road scene at different times, so that the
associated static-object priors can be extracted accordingly. By advancing a
joint global and class-specific domain adversarial learning framework,
adaptation of pre-trained segmenters to that city can be achieved without the
need of any user annotation or interaction. We show that our method improves
the performance of semantic segmentation in multiple cities across continents,
while it performs favorably against state-of-the-art approaches requiring
annotated training data.Comment: 13 pages, 10 figure
- …