Search CORE

26 research outputs found

Retraining-free Customized ASR for Enharmonic Words Based on a Named-Entity-Aware Model and Phoneme Similarity Estimation

Author: Hata Kazuya
Nakadai Kazuhiro
Sudo Yui
Publication venue
Publication date: 28/05/2023
Field of study

End-to-end automatic speech recognition (E2E-ASR) has the potential to improve performance, but a specific issue that needs to be addressed is the difficulty it has in handling enharmonic words: named entities (NEs) with the same pronunciation and part of speech that are spelled differently. This often occurs with Japanese personal names that have the same pronunciation but different Kanji characters. Since such NE words tend to be important keywords, ASR easily loses user trust if it misrecognizes them. To solve these problems, this paper proposes a novel retraining-free customized method for E2E-ASRs based on a named-entity-aware E2E-ASR model and phoneme similarity estimation. Experimental results show that the proposed method improves the target NE character error rate by 35.7% on average relative to the conventional E2E-ASR model when selecting personal names as a target NE.Comment: accepted by INTERSPEECH202

arXiv.org e-Print Archive

Improvement of DOA Estimation by using Quaternion Output in Sound Event Localization and Detection

Author: Itoyama Katsutoshi
Nakadai Kazuhiro
Nishida Kenji
Sudo Yui
Publication venue: 'New York University'
Publication date: 01/01/2019
Field of study

This paper describes improvement of Direction of Arrival (DOA) estimation performance using quaternion output in the Detection and Classification of Acoustic Scenes and Events (DCASE) 2019 Task 3. DCASE 2019 Task3 focuses on the sound event localization and detection (SELD) which is a task that simultaneously estimates the sound source direction in addition to conventional sound event detection (SED). In the baseline method, the sound source direction angle is directly regressed. However, the angle is a periodic function and it has discontinuities which may make learning unstable. Specifical-ly, even though -180 deg and 180 deg are in the same direc-tion, a large loss is calculated. Estimating DOA angles with a classification approach instead of regression can solve such instability of discontinuities but this causes limitation of reso-lution. In this paper, we propose to introduce the quaternion which is a continuous function into the output layer of the neural network instead of directly estimating the sound source direction angle. This method can be easily implemented only by changing the output of the existing neural network, and thus does not significantly increase the number of parameters in the middle layers. Experimental results show that proposed method improves the DOA estimation without significantly increasing the number of parameters.24424

Crossref

New York University Faculty Digital Archive

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models

Author: Muhammad Shakeel
Peng Yifan
Sudo Yui
Watanabe Shinji
Publication venue
Publication date: 28/05/2023
Field of study

Self-supervised learning (SSL) has achieved notable success in many speech processing tasks, but the large model size and heavy computational cost hinder the deployment. Knowledge distillation trains a small student model to mimic the behavior of a large teacher model. However, the student architecture usually needs to be manually designed and will remain fixed during training, which requires prior knowledge and can lead to suboptimal performance. Inspired by recent success of task-specific structured pruning, we propose DPHuBERT, a novel task-agnostic compression method for speech SSL based on joint distillation and pruning. Experiments on SUPERB show that DPHuBERT outperforms pure distillation methods in almost all tasks. Moreover, DPHuBERT requires little training time and performs well with limited training data, making it suitable for resource-constrained applications. Our method can also be applied to various speech SSL models. Our code and models will be publicly available.Comment: Accepted at INTERSPEECH 2023. Code will be available at: https://github.com/pyf98/DPHuBER

arXiv.org e-Print Archive

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders

Author: Shakeel Muhammad
Shi Jiatong
Sudo Yui
Watanabe Shinji
Yan Brian
Publication venue
Publication date: 29/05/2023
Field of study

The network architecture of end-to-end (E2E) automatic speech recognition (ASR) can be classified into several models, including connectionist temporal classification (CTC), recurrent neural network transducer (RNN-T), attention mechanism, and non-autoregressive mask-predict models. Since each of these network architectures has pros and cons, a typical use case is to switch these separate models depending on the application requirement, resulting in the increased overhead of maintaining all models. Several methods for integrating two of these complementary models to mitigate the overhead issue have been proposed; however, if we integrate more models, we will further benefit from these complementary models and realize broader applications with a single system. This paper proposes four-decoder joint modeling (4D) of CTC, attention, RNN-T, and mask-predict, which has the following three advantages: 1) The four decoders are jointly trained so that they can be easily switched depending on the application scenarios. 2) Joint training may bring model regularization and improve the model robustness thanks to their complementary properties. 3) Novel one-pass joint decoding methods using CTC, attention, and RNN-T further improves the performance. The experimental results showed that the proposed model consistently reduced the WER.Comment: Accepted by INTERRSPEECH202

arXiv.org e-Print Archive

Contextualized Automatic Speech Recognition with Attention-Based Bias Phrase Boosted Beam Search

Author: Fukumoto Yosuke
Peng Yifan
Shakeel Muhammad
Sudo Yui
Watanabe Shinji
Publication venue
Publication date: 18/01/2024
Field of study

End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.Comment: accepted by ICASSP2022

arXiv.org e-Print Archive

Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data

Author: Arora Siddhant
Berrebbi Dan
Chang Xuankai
Chen William
Jung Jee-weon
Li Xinjian
Maiti Soumi
Peng Yifan
Shakeel Muhammad
Sharma Roshan
Shi Jiatong
Sudo Yui
Tian Jinchuan
Watanabe Shinji
Yan Brian
Zhang Wangyou
Publication venue
Publication date: 24/10/2023
Field of study

Pre-training speech models on large volumes of data has achieved remarkable success. OpenAI Whisper is a multilingual multitask model trained on 680k hours of supervised speech data. It generalizes well to various speech recognition and translation benchmarks even in a zero-shot setup. However, the full pipeline for developing such models (from data collection to training) is not publicly accessible, which makes it difficult for researchers to further improve its performance and address training-related issues such as efficiency, robustness, fairness, and bias. This work presents an Open Whisper-style Speech Model (OWSM), which reproduces Whisper-style training using an open-source toolkit and publicly available data. OWSM even supports more translation directions and can be more efficient to train. We will publicly release all scripts used for data preparation, training, inference, and scoring as well as pre-trained models and training logs to promote open science.Comment: Accepted at ASRU 202

arXiv.org e-Print Archive

深層学習ベースの環境音セグメンテーション -音源定位、分離、クラス分類の統合-

Author: Sudo Yui
周藤唯
Publication venue
Publication date: 25/02/2021
Field of study

Institutional Repositories DataBase (IRDB)

深層学習ベースの環境音セグメンテーション -音源定位、分離、クラス分類の統合-

Author: Sudo Yui
周藤唯
Publication venue
Publication date: 25/02/2021
Field of study

Institutional Repositories DataBase (IRDB)

深層学習ベースの環境音セグメンテーション -音源定位、分離、クラス分類の統合-

Author: Sudo Yui
周藤唯
Publication venue
Publication date: 25/02/2021
Field of study

Institutional Repositories DataBase (IRDB)

Small RNA Esr41 inversely regulates expression of LEE and flagellar genes in enterohaemorrhagic Escherichia coli

Author: Aiba
Akiko Soma
Iyoda
Kenta Saito
Naoki Sudo
Nataro
Sunao Iyoda
Taku Oshima
Wassarman
Yasuhiko Sekine
Yui Ohto
Publication venue: 'Microbiology Society'
Publication date
Field of study

Crossref