2 research outputs found
Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript
Recent years have witnessed significant improvement in ASR systems to
recognize spoken utterances. However, it is still a challenging task for noisy
and out-of-domain data, where substitution and deletion errors are prevalent in
the transcribed text. These errors significantly degrade the performance of
downstream tasks. In this work, we propose a BERT-style language model,
referred to as PhonemeBERT, that learns a joint language model with phoneme
sequence and ASR transcript to learn phonetic-aware representations that are
robust to ASR errors. We show that PhonemeBERT can be used on downstream tasks
using phoneme sequences as additional features, and also in low-resource setup
where we only have ASR-transcripts for the downstream tasks with no phoneme
information available. We evaluate our approach extensively by generating noisy
data for three benchmark datasets - Stanford Sentiment Treebank, TREC and ATIS
for sentiment, question and intent classification tasks respectively. The
results of the proposed approach beats the state-of-the-art baselines
comprehensively on each dataset.Comment: Accepted to Interspeech 2021 conferenc
A systematic comparison of grapheme-based vs. phoneme-based label units for encoder-decoder-attention models
Following the rationale of end-to-end modeling, CTC, RNN-T or
encoder-decoder-attention models for automatic speech recognition (ASR) use
graphemes or grapheme-based subword units based on e.g. byte-pair encoding
(BPE). The mapping from pronunciation to spelling is learned completely from
data. In contrast to this, classical approaches to ASR employ secondary
knowledge sources in the form of phoneme lists to define phonetic output labels
and pronunciation lexica. In this work, we do a systematic comparison between
grapheme- and phoneme-based output labels for an encoder-decoder-attention ASR
model. We investigate the use of single phonemes as well as BPE-based phoneme
groups as output labels of our model. To preserve a simplified and efficient
decoder design, we also extend the phoneme set by auxiliary units to be able to
distinguish homophones. Experiments performed on the Switchboard 300h and
LibriSpeech benchmarks show that phoneme-based modeling is competitive to
grapheme-based encoder-decoder-attention modeling.Comment: submission to ICASSP 202