151 research outputs found
Self-Remixing: Unsupervised Speech Separation via Separation and Remixing
We present Self-Remixing, a novel self-supervised speech separation method,
which refines a pre-trained separation model in an unsupervised manner. The
proposed method consists of a shuffler module and a solver module, and they
grow together through separation and remixing processes. Specifically, the
shuffler first separates observed mixtures and makes pseudo-mixtures by
shuffling and remixing the separated signals. The solver then separates the
pseudo-mixtures and remixes the separated signals back to the observed
mixtures. The solver is trained using the observed mixtures as supervision,
while the shuffler's weights are updated by taking the moving average with the
solver's, generating the pseudo-mixtures with fewer distortions. Our
experiments demonstrate that Self-Remixing gives better performance over
existing remixing-based self-supervised methods with the same or less training
costs under unsupervised setup. Self-Remixing also outperforms baselines in
semi-supervised domain adaptation, showing effectiveness in multiple setups.Comment: Accepted by ICASSP2023, 5pages, 2figures, 2table
Remixing-based Unsupervised Source Separation from Scratch
We propose an unsupervised approach for training separation models from
scratch using RemixIT and Self-Remixing, which are recently proposed
self-supervised learning methods for refining pre-trained models. They first
separate mixtures with a teacher model and create pseudo-mixtures by shuffling
and remixing the separated signals. A student model is then trained to separate
the pseudo-mixtures using either the teacher's outputs or the initial mixtures
as supervision. To refine the teacher's outputs, the teacher's weights are
updated with the student's weights. While these methods originally assumed that
the teacher is pre-trained, we show that they are capable of training models
from scratch. We also introduce a simple remixing method to stabilize training.
Experimental results demonstrate that the proposed approach outperforms mixture
invariant training, which is currently the only available approach for training
a monaural separation model from scratch.Comment: Interspeech2023, 5pages, 2figures, 2table
Deep Multi-stream Network for Video-based Calving Sign Detection
We have designed a deep multi-stream network for automatically detecting
calving signs from video. Calving sign detection from a camera, which is a
non-contact sensor, is expected to enable more efficient livestock management.
As large-scale, well-developed data cannot generally be assumed when
establishing calving detection systems, the basis for making the prediction
needs to be presented to farmers during operation, so black-box modeling (also
known as end-to-end modeling) is not appropriate. For practical operation of
calving detection systems, the present study aims to incorporate expert
knowledge into a deep neural network. To this end, we propose a multi-stream
calving sign detection network in which multiple calving-related features are
extracted from the corresponding feature extraction networks designed for each
attribute with different characteristics, such as a cow's posture, rotation,
and movement, known as calving signs, and are then integrated appropriately
depending on the cow's situation. Experimental comparisons conducted using
videos of 15 cows demonstrated that our multi-stream system yielded a
significant improvement over the end-to-end system, and the multi-stream
architecture significantly contributed to a reduction in detection errors. In
addition, the distinctive mixture weights we observed helped provide
interpretability of the system's behavior
Harnessing the Zero-Shot Power of Instruction-Tuned Large Language Model in End-to-End Speech Recognition
We present a novel integration of an instruction-tuned large language model
(LLM) and end-to-end automatic speech recognition (ASR). Modern LLMs can
perform a wide range of linguistic tasks within zero-shot learning when
provided with a precise instruction or a prompt to guide the text generation
process towards the desired task. We explore using this zero-shot capability of
LLMs to extract linguistic information that can contribute to improving ASR
performance. Specifically, we direct an LLM to correct grammatical errors in an
ASR hypothesis and harness the embedded linguistic knowledge to conduct
end-to-end ASR. The proposed model is built on the hybrid connectionist
temporal classification (CTC) and attention architecture, where an
instruction-tuned LLM (i.e., Llama2) is employed as a front-end of the decoder.
An ASR hypothesis, subject to correction, is obtained from the encoder via CTC
decoding, which is then fed into the LLM along with an instruction. The decoder
subsequently takes as input the LLM embeddings to perform sequence generation,
incorporating acoustic information from the encoder output. Experimental
results and analyses demonstrate that the proposed integration yields promising
performance improvements, and our approach largely benefits from LLM-based
rescoring.Comment: Submitted to ICASSP202
BECTRA: Transducer-based End-to-End ASR with BERT-Enhanced Encoder
We present BERT-CTC-Transducer (BECTRA), a novel end-to-end automatic speech
recognition (E2E-ASR) model formulated by the transducer with a BERT-enhanced
encoder. Integrating a large-scale pre-trained language model (LM) into E2E-ASR
has been actively studied, aiming to utilize versatile linguistic knowledge for
generating accurate text. One crucial factor that makes this integration
challenging lies in the vocabulary mismatch; the vocabulary constructed for a
pre-trained LM is generally too large for E2E-ASR training and is likely to
have a mismatch against a target ASR domain. To overcome such an issue, we
propose BECTRA, an extended version of our previous BERT-CTC, that realizes
BERT-based E2E-ASR using a vocabulary of interest. BECTRA is a transducer-based
model, which adopts BERT-CTC for its encoder and trains an ASR-specific decoder
using a vocabulary suitable for a target task. With the combination of the
transducer and BERT-CTC, we also propose a novel inference algorithm for taking
advantage of both autoregressive and non-autoregressive decoding. Experimental
results on several ASR tasks, varying in amounts of data, speaking styles, and
languages, demonstrate that BECTRA outperforms BERT-CTC by effectively dealing
with the vocabulary mismatch while exploiting BERT knowledge.Comment: Submitted to ICASSP202
InterMPL: Momentum Pseudo-Labeling with Intermediate CTC Loss
This paper presents InterMPL, a semi-supervised learning method of end-to-end
automatic speech recognition (ASR) that performs pseudo-labeling (PL) with
intermediate supervision. Momentum PL (MPL) trains a connectionist temporal
classification (CTC)-based model on unlabeled data by continuously generating
pseudo-labels on the fly and improving their quality. In contrast to
autoregressive formulations, such as the attention-based encoder-decoder and
transducer, CTC is well suited for MPL, or PL-based semi-supervised ASR in
general, owing to its simple/fast inference algorithm and robustness against
generating collapsed labels. However, CTC generally yields inferior performance
than the autoregressive models due to the conditional independence assumption,
thereby limiting the performance of MPL. We propose to enhance MPL by
introducing intermediate loss, inspired by the recent advances in CTC-based
modeling. Specifically, we focus on self-conditional and hierarchical
conditional CTC, that apply auxiliary CTC losses to intermediate layers such
that the conditional independence assumption is explicitly relaxed. We also
explore how pseudo-labels should be generated and used as supervision for
intermediate losses. Experimental results in different semi-supervised settings
demonstrate that the proposed approach outperforms MPL and improves an ASR
model by up to a 12.1% absolute performance gain. In addition, our detailed
analysis validates the importance of the intermediate loss.Comment: Submitted to ICASSP202
Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition
Achieving high accuracy with low latency has always been a challenge in
streaming end-to-end automatic speech recognition (ASR) systems. By attending
to more future contexts, a streaming ASR model achieves higher accuracy but
results in larger latency, which hurts the streaming performance. In the
Mask-CTC framework, an encoder network is trained to learn the feature
representation that anticipates long-term contexts, which is desirable for
streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in
achieving low latency and high accuracy for triggered attention-based ASR.
However, the effectiveness of this method has not been demonstrated for various
model architectures, nor has it been verified that the encoder has the expected
look-ahead capability to reduce latency. This study, therefore, examines the
effectiveness of Mask-CTCbased pre-training for models with different
architectures, such as Transformer-Transducer and contextual block streaming
ASR. We also discuss the effect of the proposed pre-training method on
obtaining accurate output spike timing.Comment: Accepted to EUSIPCO 202
- …