26 research outputs found
Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation
End-to-end automatic speech recognition (E2E-ASR) has the potential to
improve performance, but a specific issue that needs to be addressed is the
difficulty it has in handling enharmonic words: named entities (NEs) with the
same pronunciation and part of speech that are spelled differently. This often
occurs with Japanese personal names that have the same pronunciation but
different Kanji characters. Since such NE words tend to be important keywords,
ASR easily loses user trust if it misrecognizes them. To solve these problems,
this paper proposes a novel retraining-free customized method for E2E-ASRs
based on a named-entity-aware E2E-ASR model and phoneme similarity estimation.
Experimental results show that the proposed method improves the target NE
character error rate by 35.7% on average relative to the conventional E2E-ASR
model when selecting personal names as a target NE.Comment: accepted by INTERSPEECH202
Improvement of DOA Estimation by using Quaternion Output in Sound Event Localization and Detection
This paper describes improvement of Direction of Arrival (DOA) estimation performance using quaternion output in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3. DCASE 2019 Task3 focuses on the sound event localization and detection (SELD) which is a task that simultaneously estimates the sound source direction in addition to conventional sound event detection (SED). In the baseline method, the sound source direction angle is directly regressed. However, the angle is a periodic function and it has discontinuities which may make learning unstable. Specifical-ly, even though -180 deg and 180 deg are in the same direc-tion, a large loss is calculated. Estimating DOA angles with a classification approach instead of regression can solve such instability of discontinuities but this causes limitation of reso-lution. In this paper, we propose to introduce the quaternion which is a continuous function into the output layer of the neural network instead of directly estimating the sound source direction angle. This method can be easily implemented only by changing the output of the existing neural network, and thus does not significantly increase the number of parameters in the middle layers. Experimental results show that proposed method improves the DOA estimation without significantly increasing the number of parameters.24424
DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models
Self-supervised learning (SSL) has achieved notable success in many speech
processing tasks, but the large model size and heavy computational cost hinder
the deployment. Knowledge distillation trains a small student model to mimic
the behavior of a large teacher model. However, the student architecture
usually needs to be manually designed and will remain fixed during training,
which requires prior knowledge and can lead to suboptimal performance. Inspired
by recent success of task-specific structured pruning, we propose DPHuBERT, a
novel task-agnostic compression method for speech SSL based on joint
distillation and pruning. Experiments on SUPERB show that DPHuBERT outperforms
pure distillation methods in almost all tasks. Moreover, DPHuBERT requires
little training time and performs well with limited training data, making it
suitable for resource-constrained applications. Our method can also be applied
to various speech SSL models. Our code and models will be publicly available.Comment: Accepted at INTERSPEECH 2023. Code will be available at:
https://github.com/pyf98/DPHuBER
4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders
The network architecture of end-to-end (E2E) automatic speech recognition
(ASR) can be classified into several models, including connectionist temporal
classification (CTC), recurrent neural network transducer (RNN-T), attention
mechanism, and non-autoregressive mask-predict models. Since each of these
network architectures has pros and cons, a typical use case is to switch these
separate models depending on the application requirement, resulting in the
increased overhead of maintaining all models. Several methods for integrating
two of these complementary models to mitigate the overhead issue have been
proposed; however, if we integrate more models, we will further benefit from
these complementary models and realize broader applications with a single
system. This paper proposes four-decoder joint modeling (4D) of CTC, attention,
RNN-T, and mask-predict, which has the following three advantages: 1) The four
decoders are jointly trained so that they can be easily switched depending on
the application scenarios. 2) Joint training may bring model regularization and
improve the model robustness thanks to their complementary properties. 3) Novel
one-pass joint decoding methods using CTC, attention, and RNN-T further
improves the performance. The experimental results showed that the proposed
model consistently reduced the WER.Comment: Accepted by INTERRSPEECH202
Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search
End-to-end (E2E) automatic speech recognition (ASR) methods exhibit
remarkable performance. However, since the performance of such methods is
intrinsically linked to the context present in the training data, E2E-ASR
methods do not perform as desired for unseen user contexts (e.g., technical
terms, personal names, and playlists). Thus, E2E-ASR methods must be easily
contextualized by the user or developer. This paper proposes an attention-based
contextual biasing method that can be customized using an editable phrase list
(referred to as a bias list). The proposed method can be trained effectively by
combining a bias phrase index loss and special tokens to detect the bias
phrases in the input speech data. In addition, to improve the contextualization
performance during inference further, we propose a bias phrase boosted (BPB)
beam search algorithm based on the bias phrase index probability. Experimental
results demonstrate that the proposed method consistently improves the word
error rate and the character error rate of the target phrases in the bias list
on both the Librispeech-960 (English) and our in-house (Japanese) dataset,
respectively.Comment: accepted by ICASSP2022
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data
Pre-training speech models on large volumes of data has achieved remarkable
success. OpenAI Whisper is a multilingual multitask model trained on 680k hours
of supervised speech data. It generalizes well to various speech recognition
and translation benchmarks even in a zero-shot setup. However, the full
pipeline for developing such models (from data collection to training) is not
publicly accessible, which makes it difficult for researchers to further
improve its performance and address training-related issues such as efficiency,
robustness, fairness, and bias. This work presents an Open Whisper-style Speech
Model (OWSM), which reproduces Whisper-style training using an open-source
toolkit and publicly available data. OWSM even supports more translation
directions and can be more efficient to train. We will publicly release all
scripts used for data preparation, training, inference, and scoring as well as
pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202