32 research outputs found
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
Spatial-Aware GAN for Unsupervised Person Re-identification
The recent person re-identification research has achieved great success by
learning from a large number of labeled person images. On the other hand, the
learned models often experience significant performance drops when applied to
images collected in a different environment. Unsupervised domain adaptation
(UDA) has been investigated to mitigate this constraint, but most existing
systems adapt images at pixel level only and ignore obvious discrepancies at
spatial level. This paper presents an innovative UDA-based person
re-identification network that is capable of adapting images at both spatial
and pixel levels simultaneously. A novel disentangled cycle-consistency loss is
designed which guides the learning of spatial-level and pixel-level adaptation
in a collaborative manner. In addition, a novel multi-modal mechanism is
incorporated which is capable of generating images of different geometry views
and augmenting training images effectively. Extensive experiments over a number
of public datasets show that the proposed UDA network achieves superior person
re-identification performance as compared with the state-of-the-art.Comment: Accepted to ICPR202
Scene Text Synthesis for Efficient and Effective Deep Network Training
A large amount of annotated training images is critical for training accurate
and robust deep network models but the collection of a large amount of
annotated training images is often time-consuming and costly. Image synthesis
alleviates this constraint by generating annotated training images
automatically by machines which has attracted increasing interest in the recent
deep learning research. We develop an innovative image synthesis technique that
composes annotated training images by realistically embedding foreground
objects of interest (OOI) into background images. The proposed technique
consists of two key components that in principle boost the usefulness of the
synthesized images in deep network training. The first is context-aware
semantic coherence which ensures that the OOI are placed around semantically
coherent regions within the background image. The second is harmonious
appearance adaptation which ensures that the embedded OOI are agreeable to the
surrounding background from both geometry alignment and appearance realism. The
proposed technique has been evaluated over two related but very different
computer vision challenges, namely, scene text detection and scene text
recognition. Experiments over a number of public datasets demonstrate the
effectiveness of our proposed image synthesis technique - the use of our
synthesized images in deep network training is capable of achieving similar or
even better scene text detection and scene text recognition performance as
compared with using real images.Comment: 8 pages, 5 figure
General Neural Gauge Fields
The recent advance of neural fields, such as neural radiance fields, has
significantly pushed the boundary of scene representation learning. Aiming to
boost the computation efficiency and rendering quality of 3D scenes, a popular
line of research maps the 3D coordinate system to another measuring system,
e.g., 2D manifolds and hash tables, for modeling neural fields. The conversion
of coordinate systems can be typically dubbed as gauge transformation, which is
usually a pre-defined mapping function, e.g., orthogonal projection or spatial
hash function. This begs a question: can we directly learn a desired gauge
transformation along with the neural field in an end-to-end manner? In this
work, we extend this problem to a general paradigm with a taxonomy of discrete
& continuous cases, and develop an end-to-end learning framework to jointly
optimize the gauge transformation and neural fields. To counter the problem
that the learning of gauge transformations can collapse easily, we derive a
general regularization mechanism from the principle of information conservation
during the gauge transformation. To circumvent the high computation cost in
gauge learning with regularization, we directly derive an information-invariant
gauge transformation which allows to preserve scene information inherently and
yield superior performance.Comment: ICLR 202
Regularized Vector Quantization for Tokenized Image Synthesis
Quantizing images into discrete representations has been a fundamental
problem in unified generative modeling. Predominant approaches learn the
discrete representation either in a deterministic manner by selecting the
best-matching token or in a stochastic manner by sampling from a predicted
distribution. However, deterministic quantization suffers from severe codebook
collapse and misalignment with inference stage while stochastic quantization
suffers from low codebook utilization and perturbed reconstruction objective.
This paper presents a regularized vector quantization framework that allows to
mitigate above issues effectively by applying regularization from two
perspectives. The first is a prior distribution regularization which measures
the discrepancy between a prior token distribution and the predicted token
distribution to avoid codebook collapse and low codebook utilization. The
second is a stochastic mask regularization that introduces stochasticity during
quantization to strike a good balance between inference stage misalignment and
unperturbed reconstruction objective. In addition, we design a probabilistic
contrastive loss which serves as a calibrated metric to further mitigate the
perturbed reconstruction objective. Extensive experiments show that the
proposed quantization framework outperforms prevailing vector quantization
methods consistently across different generative models including
auto-regressive models and diffusion models.Comment: Accepted to CVPR 202