96 research outputs found
AON: Towards Arbitrarily-Oriented Text Recognition
Recognizing text from natural images is a hot research topic in computer
vision due to its various applications. Despite the enduring research of
several decades on optical character recognition (OCR), recognizing texts from
natural images is still a challenging task. This is because scene texts are
often in irregular (e.g. curved, arbitrarily-oriented or seriously distorted)
arrangements, which have not yet been well addressed in the literature.
Existing methods on text recognition mainly work with regular (horizontal and
frontal) texts and cannot be trivially generalized to handle irregular texts.
In this paper, we develop the arbitrary orientation network (AON) to directly
capture the deep features of irregular texts, which are combined into an
attention-based decoder to generate character sequence. The whole network can
be trained end-to-end by using only images and word-level annotations.
Extensive experiments on various benchmarks, including the CUTE80,
SVT-Perspective, IIIT5k, SVT and ICDAR datasets, show that the proposed
AON-based method achieves the-state-of-the-art performance in irregular
datasets, and is comparable to major existing methods in regular datasets.Comment: Accepted by CVPR201
LCB-net: Long-Context Biasing for Audio-Visual Speech Recognition
The growing prevalence of online conferences and courses presents a new
challenge in improving automatic speech recognition (ASR) with enriched textual
information from video slides. In contrast to rare phrase lists, the slides
within videos are synchronized in real-time with the speech, enabling the
extraction of long contextual bias. Therefore, we propose a novel long-context
biasing network (LCB-net) for audio-visual speech recognition (AVSR) to
leverage the long-context information available in videos effectively.
Specifically, we adopt a bi-encoder architecture to simultaneously model audio
and long-context biasing. Besides, we also propose a biasing prediction module
that utilizes binary cross entropy (BCE) loss to explicitly determine biased
phrases in the long-context biasing. Furthermore, we introduce a dynamic
contextual phrases simulation to enhance the generalization and robustness of
our LCB-net. Experiments on the SlideSpeech, a large-scale audio-visual corpus
enriched with slides, reveal that our proposed LCB-net outperforms general ASR
model by 9.4%/9.1%/10.9% relative WER/U-WER/B-WER reduction on test set, which
enjoys high unbiased and biased performance. Moreover, we also evaluate our
model on LibriSpeech corpus, leading to 23.8%/19.2%/35.4% relative
WER/U-WER/B-WER reduction over the ASR model.Comment: Accepted by ICASPP 202
A Comparative Study on multichannel Speaker-attributed automatic speech recognition in Multi-party Meetings
Speaker-attributed automatic speech recognition (SA-ASR) in multiparty
meeting scenarios is one of the most valuable and challenging ASR task. It was
shown that single-channel frame-level diarization with serialized output
training (SC-FD-SOT), single-channel word-level diarization with SOT
(SC-WD-SOT) and joint training of single-channel target-speaker separation and
ASR (SC-TS-ASR) can be exploited to partially solve this problem. SC-FD-SOT
obtains the speaker-attributed transcriptions by aligning the speaker
diarization results with the ASR hypotheses, SC-WD-SOT uses word-level
diarization to get rid of the alignment dependence on timestamps, and SC-TS-ASR
jointly trains target-speaker separation and ASR modules, which achieves the
best performance. In this paper, we propose three corresponding multichannel
(MC) SA-ASR approaches, namely MC-FD-SOT, MC-WD-SOT and MC-TS-ASR. For
different tasks/models, different multichannel data fusion strategies are
considered, including channel-level cross-channel attention for MC-FD-SOT,
frame-level cross-channel attention for MC-WD-SOT and neural beamforming for
MC-TS-ASR. Experimental results on the AliMeeting corpus reveal that our
proposed multichannel SA-ASR models can consistently outperform the
corresponding single-channel counterparts in terms of the speaker-dependent
character error rate (SD-CER)
BA-SOT: Boundary-Aware Serialized Output Training for Multi-Talker ASR
The recently proposed serialized output training (SOT) simplifies
multi-talker automatic speech recognition (ASR) by generating speaker
transcriptions separated by a special token. However, frequent speaker changes
can make speaker change prediction difficult. To address this, we propose
boundary-aware serialized output training (BA-SOT), which explicitly
incorporates boundary knowledge into the decoder via a speaker change detection
task and boundary constraint loss. We also introduce a two-stage connectionist
temporal classification (CTC) strategy that incorporates token-level SOT CTC to
restore temporal context information. Besides typical character error rate
(CER), we introduce utterance-dependent character error rate (UD-CER) to
further measure the precision of speaker change prediction. Compared to
original SOT, BA-SOT reduces CER/UD-CER by 5.1%/14.0%, and leveraging a
pre-trained ASR model for BA-SOT model initialization further reduces
CER/UD-CER by 8.4%/19.9%.Comment: Accepted by INTERSPEECH 202
- …