149 research outputs found
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
Spectral Unsupervised Domain Adaptation for Visual Recognition
Unsupervised domain adaptation (UDA) aims to learn a well-performed model in
an unlabeled target domain by leveraging labeled data from one or multiple
related source domains. It remains a great challenge due to 1) the lack of
annotations in the target domain and 2) the rich discrepancy between the
distributions of source and target data. We propose Spectral UDA (SUDA), an
efficient yet effective UDA technique that works in the spectral space and is
generic across different visual recognition tasks in detection, classification
and segmentation. SUDA addresses UDA challenges from two perspectives. First,
it mitigates inter-domain discrepancies by a spectrum transformer (ST) that
maps source and target images into spectral space and learns to enhance
domain-invariant spectra while suppressing domain-variant spectra
simultaneously. To this end, we design novel adversarial multi-head spectrum
attention that leverages contextual information to identify domain-variant and
domain-invariant spectra effectively. Second, it mitigates the lack of
annotations in target domain by introducing multi-view spectral learning which
aims to learn comprehensive yet confident target representations by maximizing
the mutual information among multiple ST augmentations capturing different
spectral views of each target sample. Extensive experiments over different
visual tasks (e.g., detection, classification and segmentation) show that SUDA
achieves superior accuracy and it is also complementary with state-of-the-art
UDA methods with consistent performance boosts but little extra computation
Vision-Language Models for Vision Tasks: A Survey
Most visual recognition studies rely heavily on crowd-labelled data in deep
neural networks (DNNs) training, and they usually train a DNN for each single
visual recognition task, leading to a laborious and time-consuming visual
recognition paradigm. To address the two challenges, Vision-Language Models
(VLMs) have been intensively investigated recently, which learns rich
vision-language correlation from web-scale image-text pairs that are almost
infinitely available on the Internet and enables zero-shot predictions on
various visual recognition tasks with a single VLM. This paper provides a
systematic review of visual language models for various visual recognition
tasks, including: (1) the background that introduces the development of visual
recognition paradigms; (2) the foundations of VLM that summarize the
widely-adopted network architectures, pre-training objectives, and downstream
tasks; (3) the widely-adopted datasets in VLM pre-training and evaluations; (4)
the review and categorization of existing VLM pre-training methods, VLM
transfer learning methods, and VLM knowledge distillation methods; (5) the
benchmarking, analysis and discussion of the reviewed methods; (6) several
research challenges and potential research directions that could be pursued in
the future VLM studies for visual recognition. A project associated with this
survey has been created at https://github.com/jingyi0000/VLM_survey
Domain Generalization via Balancing Training Difficulty and Model Capability
Domain generalization (DG) aims to learn domain-generalizable models from one
or multiple source domains that can perform well in unseen target domains.
Despite its recent progress, most existing work suffers from the misalignment
between the difficulty level of training samples and the capability of
contemporarily trained models, leading to over-fitting or under-fitting in the
trained generalization model. We design MoDify, a Momentum Difficulty framework
that tackles the misalignment by balancing the seesaw between the model's
capability and the samples' difficulties along the training process. MoDify
consists of two novel designs that collaborate to fight against the
misalignment while learning domain-generalizable models. The first is
MoDify-based Data Augmentation which exploits an RGB Shuffle technique to
generate difficulty-aware training samples on the fly. The second is
MoDify-based Network Optimization which dynamically schedules the training
samples for balanced and smooth learning with appropriate difficulty. Without
bells and whistles, a simple implementation of MoDify achieves superior
performance across multiple benchmarks. In addition, MoDify can complement
existing methods as a plug-in, and it is generic and can work for different
visual recognition tasks.Comment: 11 pages, 6 figures, Accepted by ICCV 202
MLAN: Multi-Level Adversarial Network for Domain Adaptive Semantic Segmentation
Recent progresses in domain adaptive semantic segmentation demonstrate the
effectiveness of adversarial learning (AL) in unsupervised domain adaptation.
However, most adversarial learning based methods align source and target
distributions at a global image level but neglect the inconsistency around
local image regions. This paper presents a novel multi-level adversarial
network (MLAN) that aims to address inter-domain inconsistency at both global
image level and local region level optimally. MLAN has two novel designs,
namely, region-level adversarial learning (RL-AL) and co-regularized
adversarial learning (CR-AL). Specifically, RL-AL models prototypical regional
context-relations explicitly in the feature space of a labelled source domain
and transfers them to an unlabelled target domain via adversarial learning.
CR-AL fuses region-level AL and image-level AL optimally via mutual
regularization. In addition, we design a multi-level consistency map that can
guide domain adaptation in both input space (, image-to-image
translation) and output space (, self-training) effectively. Extensive
experiments show that MLAN outperforms the state-of-the-art with a large margin
consistently across multiple datasets.Comment: Submitted to P
- …