25,027 research outputs found
SHAS: Approaching optimal Segmentation for End-to-End Speech Translation
Speech translation models are unable to directly process long audios, like
TED talks, which have to be split into shorter segments. Speech translation
datasets provide manual segmentations of the audios, which are not available in
real-world scenarios, and existing segmentation methods usually significantly
reduce translation quality at inference time. To bridge the gap between the
manual segmentation of training and the automatic one at inference, we propose
Supervised Hybrid Audio Segmentation (SHAS), a method that can effectively
learn the optimal segmentation from any manually segmented speech corpus.
First, we train a classifier to identify the included frames in a segmentation,
using speech representations from a pre-trained wav2vec 2.0. The optimal
splitting points are then found by a probabilistic Divide-and-Conquer algorithm
that progressively splits at the frame of lowest probability until all segments
are below a pre-specified length. Experiments on MuST-C and mTEDx show that the
translation of the segments produced by our method approaches the quality of
the manual segmentation on 5 languages pairs. Namely, SHAS retains 95-98% of
the manual segmentation's BLEU score, compared to the 87-93% of the best
existing methods. Our method is additionally generalizable to different domains
and achieves high zero-shot performance in unseen languages.Comment: Submitted to Interspeech 2022, 5 pages. Previous version (v1) has
additionally a 2-page Appendi
A hypothesize-and-verify framework for Text Recognition using Deep Recurrent Neural Networks
Deep LSTM is an ideal candidate for text recognition. However text
recognition involves some initial image processing steps like segmentation of
lines and words which can induce error to the recognition system. Without
segmentation, learning very long range context is difficult and becomes
computationally intractable. Therefore, alternative soft decisions are needed
at the pre-processing level. This paper proposes a hybrid text recognizer using
a deep recurrent neural network with multiple layers of abstraction and long
range context along with a language model to verify the performance of the deep
neural network. In this paper we construct a multi-hypotheses tree architecture
with candidate segments of line sequences from different segmentation
algorithms at its different branches. The deep neural network is trained on
perfectly segmented data and tests each of the candidate segments, generating
unicode sequences. In the verification step, these unicode sequences are
validated using a sub-string match with the language model and best first
search is used to find the best possible combination of alternative hypothesis
from the tree structure. Thus the verification framework using language models
eliminates wrong segmentation outputs and filters recognition errors
- …