9,786 research outputs found
DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting
End-to-end text spotting aims to integrate scene text detection and
recognition into a unified framework. Dealing with the relationship between the
two sub-tasks plays a pivotal role in designing effective spotters. Although
transformer-based methods eliminate the heuristic post-processing, they still
suffer from the synergy issue between the sub-tasks and low training
efficiency. In this paper, we present DeepSolo, a simple detection transformer
baseline that lets a single Decoder with Explicit Points Solo for text
detection and recognition simultaneously. Technically, for each text instance,
we represent the character sequence as ordered points and model them with
learnable explicit point queries. After passing a single decoder, the point
queries have encoded requisite text semantics and locations and thus can be
further decoded to the center line, boundary, script, and confidence of text
via very simple prediction heads in parallel, solving the sub-tasks in text
spotting in a unified framework. Besides, we also introduce a text-matching
criterion to deliver more accurate supervisory signals, thus enabling more
efficient training. Quantitative experiments on public benchmarks demonstrate
that DeepSolo outperforms previous state-of-the-art methods and achieves better
training efficiency. In addition, DeepSolo is also compatible with line
annotations, which require much less annotation cost than polygons. The code
will be released.Comment: The code will be available at
https://github.com/ViTAE-Transformer/DeepSol
Hierarchy Composition GAN for High-fidelity Image Synthesis
Despite the rapid progress of generative adversarial networks (GANs) in image
synthesis in recent years, the existing image synthesis approaches work in
either geometry domain or appearance domain alone which often introduces
various synthesis artifacts. This paper presents an innovative Hierarchical
Composition GAN (HIC-GAN) that incorporates image synthesis in geometry and
appearance domains into an end-to-end trainable network and achieves superior
synthesis realism in both domains simultaneously. We design an innovative
hierarchical composition mechanism that is capable of learning realistic
composition geometry and handling occlusions while multiple foreground objects
are involved in image composition. In addition, we introduce a novel attention
mask mechanism that guides to adapt the appearance of foreground objects which
also helps to provide better training reference for learning in geometry
domain. Extensive experiments on scene text image synthesis, portrait editing
and indoor rendering tasks show that the proposed HIC-GAN achieves superior
synthesis performance qualitatively and quantitatively.Comment: 11 pages, 8 figure
- …