193 research outputs found

    Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition

    Full text link
    We present a novel integration of an instruction-tuned large language model (LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can perform a wide range of linguistic tasks within zero-shot learning when provided with a precise instruction or a prompt to guide the text generation process towards the desired task. We explore using this zero-shot capability of LLMs to extract linguistic information that can contribute to improving ASR performance. Specifically, we direct an LLM to correct grammatical errors in an ASR hypothesis and harness the embedded linguistic knowledge to conduct end-to-end ASR. The proposed model is built on the hybrid connectionist temporal classification (CTC) and attention architecture, where an instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder. An ASR hypothesis, subject to correction, is obtained from the encoder via CTC decoding, which is then fed into the LLM along with an instruction. The decoder subsequently takes as input the LLM embeddings to perform sequence generation, incorporating acoustic information from the encoder output. Experimental results and analyses demonstrate that the proposed integration yields promising performance improvements, and our approach largely benefits from LLM-based rescoring.Comment: Submitted to ICASSP202

    InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss

    Full text link
    This paper presents InterMPL, a semi-supervised learning method of end-to-end automatic speech recognition (ASR) that performs pseudo-labeling (PL) with intermediate supervision. Momentum PL (MPL) trains a connectionist temporal classification (CTC)-based model on unlabeled data by continuously generating pseudo-labels on the fly and improving their quality. In contrast to autoregressive formulations, such as the attention-based encoder-decoder and transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in general, owing to its simple/fast inference algorithm and robustness against generating collapsed labels. However, CTC generally yields inferior performance than the autoregressive models due to the conditional independence assumption, thereby limiting the performance of MPL. We propose to enhance MPL by introducing intermediate loss, inspired by the recent advances in CTC-based modeling. Specifically, we focus on self-conditional and hierarchical conditional CTC, that apply auxiliary CTC losses to intermediate layers such that the conditional independence assumption is explicitly relaxed. We also explore how pseudo-labels should be generated and used as supervision for intermediate losses. Experimental results in different semi-supervised settings demonstrate that the proposed approach outperforms MPL and improves an ASR model by up to a 12.1% absolute performance gain. In addition, our detailed analysis validates the importance of the intermediate loss.Comment: Submitted to ICASSP202

    BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder

    Full text link
    We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR has been actively studied, aiming to utilize versatile linguistic knowledge for generating accurate text. One crucial factor that makes this integration challenging lies in the vocabulary mismatch; the vocabulary constructed for a pre-trained LM is generally too large for E2E-ASR training and is likely to have a mismatch against a target ASR domain. To overcome such an issue, we propose BECTRA, an extended version of our previous BERT-CTC, that realizes BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder using a vocabulary suitable for a target task. With the combination of the transducer and BERT-CTC, we also propose a novel inference algorithm for taking advantage of both autoregressive and non-autoregressive decoding. Experimental results on several ASR tasks, varying in amounts of data, speaking styles, and languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing with the vocabulary mismatch while exploiting BERT knowledge.Comment: Submitted to ICASSP202

    Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition

    Full text link
    Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202

    Issues and Solutions in Introducing Western Systems to the Pre-hospital Care System in Japan

    Get PDF
    Objective: This report aims to illustrate the history and current status of Japanese emergency medical services (EMS), including development of the specialty and characteristics adapted from the U.S. and European models. In addition, recommendations are made for improvement of the current systems.Methods: Government reports and academic papers were reviewed, along with the collective experiences of the authors. Literature searches were performed in PubMed (English) and Ichushi (Japanese), using keywords such as emergency medicine and pre-hospital care. More recent and peer-reviewed articles were given priority in the selection process.Results: The pre-hospital care system in Japan has developed as a mixture of U.S. and European systems. Other countries undergoing economic and industrial development similar to Japan may benefit from emulating the Japanese EMS model.Discussion: Currently, the Japanese system is in transition, searching for the most suitable and efficient way of providing quality pre-hospital care.Conclusion: Japan has the potential to enhance its current pre-hospital care system, but this will require greater collaboration between physicians and paramedics, increased paramedic scope of medical practice, and greater Japanese societal recognition and support of paramedics.[WestJEM. 2008;9:166-170.

    Conversation-oriented ASR with multi-look-ahead CBS architecture

    Full text link
    During conversations, humans are capable of inferring the intention of the speaker at any point of the speech to prepare the following action promptly. Such ability is also the key for conversational systems to achieve rhythmic and natural conversation. To perform this, the automatic speech recognition (ASR) used for transcribing the speech in real-time must achieve high accuracy without delay. In streaming ASR, high accuracy is assured by attending to look-ahead frames, which leads to delay increments. To tackle this trade-off issue, we propose a multiple latency streaming ASR to achieve high accuracy with zero look-ahead. The proposed system contains two encoders that operate in parallel, where a primary encoder generates accurate outputs utilizing look-ahead frames, and the auxiliary encoder recognizes the look-ahead portion of the primary encoder without look-ahead. The proposed system is constructed based on contextual block streaming (CBS) architecture, which leverages block processing and has a high affinity for the multiple latency architecture. Various methods are also studied for architecting the system, including shifting the network to perform as different encoders; as well as generating both encoders' outputs in one encoding pass.Comment: Submitted to ICASSP202

    BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model

    Full text link
    This paper presents BERT-CTC, a novel formulation of end-to-end speech recognition that adapts BERT for connectionist temporal classification (CTC). Our formulation relaxes the conditional independence assumptions used in conventional CTC and incorporates linguistic knowledge through the explicit output dependency obtained by BERT contextual embedding. BERT-CTC attends to the full contexts of the input and hypothesized output sequences via the self-attention mechanism. This mechanism encourages a model to learn inner/inter-dependencies between the audio and token representations while maintaining CTC's training efficiency. During inference, BERT-CTC combines a mask-predict algorithm with CTC decoding, which iteratively refines an output sequence. The experimental results reveal that BERT-CTC improves over conventional approaches across variations in speaking styles and languages. Finally, we show that the semantic representations in BERT-CTC are beneficial towards downstream spoken language understanding tasks.Comment: v1: Accepted to Findings of EMNLP2022, v2: Minor corrections and clearer derivation of Eq. (21

    Effects of Coffee Intake on Oxidative Stress During Aging-related Alterations in Periodontal Tissue

    Get PDF
    Background/aim: The purpose of this study was to determine the anti-aging effects of coffee intake on oxidative stress in rat periodontal tissue and alveolar bone loss. Materials and methods: Male Fischer 344 rats (8 weeks old) were randomized to four groups; the baseline group immediately sacrificed, the control group fed with normal powdered food for 8 weeks, and the experimental groups fed with powdered food containing 0.62% or 1.36% coffee components for 8 weeks. Results: Alveolar bone loss and gingival level of 8-hydroxydeoxyguanosine were significantly lower in the 1.36% coffee group than in the control group. Nuclear factor erythroid 2-related factor 2 translocation to the nucleus was significantly higher in the 1.36% coffee group than in the control group. Conclusion: Continuous intake of 1.36% coffee could prevent age-related oxidative stress in the periodontal tissue and alveolar bone loss, possibly by up-regulating the Nrf2 signaling pathway
    • …
    corecore