42 research outputs found
Curriculum Domain Adaptation for Semantic Segmentation of Urban Scenes
During the last half decade, convolutional neural networks (CNNs) have
triumphed over semantic segmentation, which is one of the core tasks in many
applications such as autonomous driving. However, to train CNNs requires a
considerable amount of data, which is difficult to collect and laborious to
annotate. Recent advances in computer graphics make it possible to train CNNs
on photo-realistic synthetic imagery with computer-generated annotations.
Despite this, the domain mismatch between the real images and the synthetic
data cripples the models' performance. Hence, we propose a curriculum-style
learning approach to minimize the domain gap in urban scenery semantic
segmentation. The curriculum domain adaptation solves easy tasks first to infer
necessary properties about the target domain; in particular, the first task is
to learn global label distributions over images and local distributions over
landmark superpixels. These are easy to estimate because images of urban scenes
have strong idiosyncrasies (e.g., the size and spatial relations of buildings,
streets, cars, etc.). We then train a segmentation network while regularizing
its predictions in the target domain to follow those inferred properties. In
experiments, our method outperforms the baselines on two datasets and two
backbone networks. We also report extensive ablation studies about our
approach.Comment: This is the extended version of the ICCV 2017 paper "Curriculum
Domain Adaptation for Semantic Segmentation of Urban Scenes" with additional
GTA experimen
Improved Dropout for Shallow and Deep Learning
Dropout has been witnessed with great success in training deep neural
networks by independently zeroing out the outputs of neurons at random. It has
also received a surge of interest for shallow learning, e.g., logistic
regression. However, the independent sampling for dropout could be suboptimal
for the sake of convergence. In this paper, we propose to use multinomial
sampling for dropout, i.e., sampling features or neurons according to a
multinomial distribution with different probabilities for different
features/neurons. To exhibit the optimal dropout probabilities, we analyze the
shallow learning with multinomial dropout and establish the risk bound for
stochastic optimization. By minimizing a sampling dependent factor in the risk
bound, we obtain a distribution-dependent dropout with sampling probabilities
dependent on the second order statistics of the data distribution. To tackle
the issue of evolving distribution of neurons in deep learning, we propose an
efficient adaptive dropout (named \textbf{evolutional dropout}) that computes
the sampling probabilities on-the-fly from a mini-batch of examples. Empirical
studies on several benchmark datasets demonstrate that the proposed dropouts
achieve not only much faster convergence and but also a smaller testing error
than the standard dropout. For example, on the CIFAR-100 data, the evolutional
dropout achieves relative improvements over 10\% on the prediction performance
and over 50\% on the convergence speed compared to the standard dropout.Comment: In NIPS 201
PolarNet: An Improved Grid Representation for Online LiDAR Point Clouds Semantic Segmentation
The need for fine-grained perception in autonomous driving systems has
resulted in recently increased research on online semantic segmentation of
single-scan LiDAR. Despite the emerging datasets and technological
advancements, it remains challenging due to three reasons: (1) the need for
near-real-time latency with limited hardware; (2) uneven or even long-tailed
distribution of LiDAR points across space; and (3) an increasing number of
extremely fine-grained semantic classes. In an attempt to jointly tackle all
the aforementioned challenges, we propose a new LiDAR-specific,
nearest-neighbor-free segmentation algorithm - PolarNet. Instead of using
common spherical or bird's-eye-view projection, our polar bird's-eye-view
representation balances the points across grid cells in a polar coordinate
system, indirectly aligning a segmentation network's attention with the
long-tailed distribution of the points along the radial axis. We find that our
encoding scheme greatly increases the mIoU in three drastically different
segmentation datasets of real urban LiDAR single scans while retaining near
real-time throughput.Comment: Accepted by CVPR 2020; Code at
https://github.com/edwardzhou130/PolarSe
Video Timeline Modeling For News Story Understanding
In this paper, we present a novel problem, namely video timeline modeling.
Our objective is to create a video-associated timeline from a set of videos
related to a specific topic, thereby facilitating the content and structure
understanding of the story being told. This problem has significant potential
in various real-world applications, such as news story summarization. To
bootstrap research in this area, we curate a realistic benchmark dataset,
YouTube-News-Timeline, consisting of over k timelines and k YouTube
news videos. Additionally, we propose a set of quantitative metrics as the
protocol to comprehensively evaluate and compare methodologies. With such a
testbed, we further develop and benchmark exploratory deep learning approaches
to tackle this problem. We anticipate that this exploratory work will pave the
way for further research in video timeline modeling. The assets are available
via
https://github.com/google-research/google-research/tree/master/video_timeline_modeling.Comment: Accepted as a spotlight by NeurIPS 2023, Track on Datasets and
Benchmark
Unified Visual Relationship Detection with Vision and Language Models
This work focuses on training a single visual relationship detector
predicting over the union of label spaces from multiple datasets. Merging
labels spanning different datasets could be challenging due to inconsistent
taxonomies. The issue is exacerbated in visual relationship detection when
second-order visual semantics are introduced between pairs of objects. To
address this challenge, we propose UniVRD, a novel bottom-up method for Unified
Visual Relationship Detection by leveraging vision and language models (VLMs).
VLMs provide well-aligned image and text embeddings, where similar
relationships are optimized to be close to each other for semantic unification.
Our bottom-up design enables the model to enjoy the benefit of training with
both object detection and visual relationship datasets. Empirical results on
both human-object interaction detection and scene-graph generation demonstrate
the competitive performance of our model. UniVRD achieves 38.07 mAP on
HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP.
More importantly, we show that our unified detector performs as well as
dataset-specific models in mAP, and achieves further improvements when we scale
up the model. Our code will be made publicly available on GitHub.Comment: Accepted to ICCV 2023. Code is available at
https://github.com/google-research/scenic/tree/main/scenic/projects/univr
Spatiotemporally Discriminative Video-Language Pre-Training with Text Grounding
Most of existing video-language pre-training methods focus on instance-level
alignment between video clips and captions via global contrastive learning but
neglect rich fine-grained local information, which is of importance to
downstream tasks requiring temporal localization and semantic reasoning. In
this work, we propose a simple yet effective video-language pre-training
framework, namely G-ViLM, to learn discriminative spatiotemporal features. Two
novel designs involving spatiotemporal grounding and temporal grouping promote
learning local region-noun alignment and temporal-aware features
simultaneously. Specifically, spatiotemporal grounding aggregates semantically
similar video tokens and aligns them with noun phrases extracted from the
caption to promote local region-noun correspondences. Moreover, temporal
grouping leverages cut-and-paste to manually create temporal scene changes and
then learns distinguishable features from different scenes. Comprehensive
evaluations demonstrate that G-ViLM performs favorably against existing
approaches on four representative downstream tasks, covering text-video
retrieval, video question answering, video action recognition and temporal
action localization. G-ViLM performs competitively on all evaluated tasks and
in particular achieves R@10 of 65.1 on zero-shot MSR-VTT retrieval, over 9%
higher than the state-of-the-art method
Taming Encoder for Zero Fine-tuning Image Customization with Text-to-Image Diffusion Models
This paper proposes a method for generating images of customized objects
specified by users. The method is based on a general framework that bypasses
the lengthy optimization required by previous approaches, which often employ a
per-object optimization paradigm. Our framework adopts an encoder to capture
high-level identifiable semantics of objects, producing an object-specific
embedding with only a single feed-forward pass. The acquired object embedding
is then passed to a text-to-image synthesis model for subsequent generation. To
effectively blend a object-aware embedding space into a well developed
text-to-image model under the same generation context, we investigate different
network designs and training strategies, and propose a simple yet effective
regularized joint training scheme with an object identity preservation loss.
Additionally, we propose a caption generation scheme that become a critical
piece in fostering object specific embedding faithfully reflected into the
generation process, while keeping control and editing abilities. Once trained,
the network is able to produce diverse content and styles, conditioned on both
texts and objects. We demonstrate through experiments that our proposed method
is able to synthesize images with compelling output quality, appearance
diversity, and object fidelity, without the need of test-time optimization.
Systematic studies are also conducted to analyze our models, providing insights
for future work
VideoGLUE: Video General Understanding Evaluation of Foundation Models
We evaluate existing foundation models video understanding capabilities using
a carefully designed experiment protocol consisting of three hallmark tasks
(action recognition, temporal localization, and spatiotemporal localization),
eight datasets well received by the community, and four adaptation methods
tailoring a foundation model (FM) for a downstream task. Moreover, we propose a
scalar VideoGLUE score (VGS) to measure an FMs efficacy and efficiency when
adapting to general video understanding tasks. Our main findings are as
follows. First, task-specialized models significantly outperform the six FMs
studied in this work, in sharp contrast to what FMs have achieved in natural
language and image understanding. Second,video-native FMs, whose pretraining
data contains the video modality, are generally better than image-native FMs in
classifying motion-rich videos, localizing actions in time, and understanding a
video of more than one action. Third, the video-native FMs can perform well on
video tasks under light adaptations to downstream tasks(e.g., freezing the FM
backbones), while image-native FMs win in full end-to-end finetuning. The first
two observations reveal the need and tremendous opportunities to conduct
research on video-focused FMs, and the last confirms that both tasks and
adaptation methods matter when it comes to the evaluation of FMs