14,548 research outputs found
Decoupled Attention Network for Text Recognition
Text recognition has attracted considerable research interests because of its
various applications. The cutting-edge text recognition methods are based on
attention mechanisms. However, most of attention methods usually suffer from
serious alignment problem due to its recurrency alignment operation, where the
alignment relies on historical decoding results. To remedy this issue, we
propose a decoupled attention network (DAN), which decouples the alignment
operation from using historical decoding results. DAN is an effective, flexible
and robust end-to-end text recognizer, which consists of three components: 1) a
feature encoder that extracts visual features from the input image; 2) a
convolutional alignment module that performs the alignment operation based on
visual features from the encoder; and 3) a decoupled text decoder that makes
final prediction by jointly using the feature map and attention maps.
Experimental results show that DAN achieves state-of-the-art performance on
multiple text recognition tasks, including offline handwritten text recognition
and regular/irregular scene text recognition.Comment: 9 pages, 8 figures, 6 tables, accepted by AAAI-202
Independent language modeling architecture for end-to-end ASR
The attention-based end-to-end (E2E) automatic speech recognition (ASR)
architecture allows for joint optimization of acoustic and language models
within a single network. However, in a vanilla E2E ASR architecture, the
decoder sub-network (subnet), which incorporates the role of the language model
(LM), is conditioned on the encoder output. This means that the acoustic
encoder and the language model are entangled that doesn't allow language model
to be trained separately from external text data. To address this problem, in
this work, we propose a new architecture that separates the decoder subnet from
the encoder output. In this way, the decoupled subnet becomes an independently
trainable LM subnet, which can easily be updated using the external text data.
We study two strategies for updating the new architecture. Experimental results
show that, 1) the independent LM architecture benefits from external text data,
achieving 9.3% and 22.8% relative character and word error rate reduction on
Mandarin HKUST and English NSC datasets respectively; 2)the proposed
architecture works well with external LM and can be generalized to different
amount of labelled data
Hierarchical Cross-Modal Talking Face Generationwith Dynamic Pixel-Wise Loss
We devise a cascade GAN approach to generate talking face video, which is
robust to different face shapes, view angles, facial characteristics, and noisy
audio conditions. Instead of learning a direct mapping from audio to video
frames, we propose first to transfer audio to high-level structure, i.e., the
facial landmarks, and then to generate video frames conditioned on the
landmarks. Compared to a direct audio-to-image approach, our cascade approach
avoids fitting spurious correlations between audiovisual signals that are
irrelevant to the speech content. We, humans, are sensitive to temporal
discontinuities and subtle artifacts in video. To avoid those pixel jittering
problems and to enforce the network to focus on audiovisual-correlated regions,
we propose a novel dynamically adjustable pixel-wise loss with an attention
mechanism. Furthermore, to generate a sharper image with well-synchronized
facial movements, we propose a novel regression-based discriminator structure,
which considers sequence-level information along with frame-level information.
Thoughtful experiments on several datasets and real-world samples demonstrate
significantly better results obtained by our method than the state-of-the-art
methods in both quantitative and qualitative comparisons
- …