3 research outputs found
Channel-Aware Pretraining of Joint Encoder-Decoder Self-Supervised Model for Telephonic-Speech ASR
This paper proposes a novel technique to obtain better downstream ASR
performance from a joint encoder-decoder self-supervised model when trained
with speech pooled from two different channels (narrow and wide band). The
joint encoder-decoder self-supervised model extends the HuBERT model with a
Transformer decoder. HuBERT performs clustering of features and predicts the
class of every input frame. In simple pooling, which is our baseline, there is
no way to identify the channel information. To incorporate channel information,
we have proposed non-overlapping cluster IDs for speech from different
channels. Our method gives a relative improvement of ~4% over the joint
encoder-decoder self-supervised model built with simple pooling of data, which
serves as our baseline.Comment: 5 pages, 5 figure
Technology Pipeline for Large Scale Cross-Lingual Dubbing of Lecture Videos into Multiple Indian Languages
Cross-lingual dubbing of lecture videos requires the transcription of the
original audio, correction and removal of disfluencies, domain term discovery,
text-to-text translation into the target language, chunking of text using
target language rhythm, text-to-speech synthesis followed by isochronous
lipsyncing to the original video. This task becomes challenging when the source
and target languages belong to different language families, resulting in
differences in generated audio duration. This is further compounded by the
original speaker's rhythm, especially for extempore speech. This paper
describes the challenges in regenerating English lecture videos in Indian
languages semi-automatically. A prototype is developed for dubbing lectures
into 9 Indian languages. A mean-opinion-score (MOS) is obtained for two
languages, Hindi and Tamil, on two different courses. The output video is
compared with the original video in terms of MOS (1-5) and lip synchronisation
with scores of 4.09 and 3.74, respectively. The human effort also reduces by
75%
Domain Adaptation of low-resource Target-Domain models using well-trained ASR Conformer Models
In this paper, we investigate domain adaptation for low-resource Automatic
Speech Recognition (ASR) of target-domain data, when a well-trained ASR model
trained with a large dataset is available. We argue that in the encoder-decoder
framework, the decoder of the well-trained ASR model is largely tuned towards
the source-domain, hurting the performance of target-domain models in vanilla
transfer-learning. On the other hand, the encoder layers of the well-trained
ASR model mostly capture the acoustic characteristics. We, therefore, propose
to use the embeddings tapped from these encoder layers as features for a
downstream Conformer target-domain model and show that they provide significant
improvements. We do ablation studies on which encoder layer is optimal to tap
the embeddings, as well as the effect of freezing or updating the well-trained
ASR model's encoder layers. We further show that applying Spectral Augmentation
(SpecAug) on the proposed features (this is in addition to default SpecAug on
input spectral features) provides a further improvement on the target-domain
performance. For the LibriSpeech-100-clean data as target-domain and SPGI-5000
as a well-trained model, we get 30% relative improvement over baseline.
Similarly, with WSJ data as target-domain and LibriSpeech-960 as a well-trained
model, we get 50% relative improvement over baseline.Comment: 5 pages,2 figure