18 research outputs found
Deep Learning for Audio Signal Processing
Given the recent surge in developments of deep learning, this article
provides a review of the state-of-the-art deep learning techniques for audio
signal processing. Speech, music, and environmental sound processing are
considered side-by-side, in order to point out similarities and differences
between the domains, highlighting general methods, problems, key references,
and potential for cross-fertilization between areas. The dominant feature
representations (in particular, log-mel spectra and raw waveform) and deep
learning models are reviewed, including convolutional neural networks, variants
of the long short-term memory architecture, as well as more audio-specific
neural network models. Subsequently, prominent deep learning application areas
are covered, i.e. audio recognition (automatic speech recognition, music
information retrieval, environmental sound detection, localization and
tracking) and synthesis and transformation (source separation, audio
enhancement, generative models for speech, sound, and music synthesis).
Finally, key issues and future questions regarding deep learning applied to
audio signal processing are identified.Comment: 15 pages, 2 pdf figure
Unified End-to-End Speech Recognition and Endpointing for Fast and Efficient Speech Systems
Automatic speech recognition (ASR) systems typically rely on an external
endpointer (EP) model to identify speech boundaries. In this work, we propose a
method to jointly train the ASR and EP tasks in a single end-to-end (E2E)
multitask model, improving EP quality by optionally leveraging information from
the ASR audio encoder. We introduce a "switch" connection, which trains the EP
to consume either the audio frames directly or low-level latent representations
from the ASR model. This results in a single E2E model that can be used during
inference to perform frame filtering at low cost, and also make high quality
end-of-query (EOQ) predictions based on ongoing ASR computation. We present
results on a voice search test set showing that, compared to separate
single-task models, this approach reduces median endpoint latency by 120 ms
(30.8% reduction), and 90th percentile latency by 170 ms (23.0% reduction),
without regressing word error rate. For continuous recognition, WER improves by
10.6% (relative).Comment: To be published in Spoken Language Technology Workshop (SLT) 202
Semantic Segmentation with Bidirectional Language Models Improves Long-form ASR
We propose a method of segmenting long-form speech by separating semantically
complete sentences within the utterance. This prevents the ASR decoder from
needlessly processing faraway context while also preventing it from missing
relevant context within the current sentence. Semantically complete sentence
boundaries are typically demarcated by punctuation in written text; but
unfortunately, spoken real-world utterances rarely contain punctuation. We
address this limitation by distilling punctuation knowledge from a
bidirectional teacher language model (LM) trained on written, punctuated text.
We compare our segmenter, which is distilled from the LM teacher, against a
segmenter distilled from a acoustic-pause-based teacher used in other works, on
a streaming ASR pipeline. The pipeline with our segmenter achieves a 3.2%
relative WER gain along with a 60 ms median end-of-segment latency reduction on
a YouTube captioning task.Comment: Interspeech 2023. First 3 authors contributed equall
Text Injection for Capitalization and Turn-Taking Prediction in Speech Models
Text injection for automatic speech recognition (ASR), wherein unpaired
text-only data is used to supplement paired audio-text data, has shown
promising improvements for word error rate. This study examines the use of text
injection for auxiliary tasks, which are the non-ASR tasks often performed by
an E2E model. In this work, we use joint end-to-end and internal language model
training (JEIT) as our text injection algorithm to train an ASR model which
performs two auxiliary tasks. The first is capitalization, which is a
de-normalization task. The second is turn-taking prediction, which attempts to
identify whether a user has completed their conversation turn in a digital
assistant interaction. We show results demonstrating that our text injection
method boosts capitalization performance for long-tail data, and improves
turn-taking detection recall
THE BLAME GAME IN MEETING ROOM ASR: AN ANALYSIS OF FEATURE VERSUS MODEL ERRORS IN NOISY AND MISMATCHED CONDITIONS
ABSTRACT Given a test waveform, state-of-the-art ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are brittle in noisy or mismatched conditions, there has been little work in quantitatively attributing the errors to features or to models. This paper attributes the sources of these errors in three conditions: (a) matched near-field, (b) matched far-field, and a (c) mismatched condition. We undertake a series of diagnostic analyses employing the bootstrap method to probe a meeting room ASR system. Results show that when the conditions are matched (even if they are far-field), the model errors dominate; however, in mismatched conditions features are neither invariant nor separable and this causes as many errors as the model does
Towards General-Purpose Text-Instruction-Guided Voice Conversion
This paper introduces a novel voice conversion (VC) model, guided by text
instructions such as "articulate slowly with a deep tone" or "speak in a
cheerful boyish voice". Unlike traditional methods that rely on reference
utterances to determine the attributes of the converted speech, our model adds
versatility and specificity to voice conversion. The proposed VC model is a
neural codec language model which processes a sequence of discrete codes,
resulting in the code sequence of converted speech. It utilizes text
instructions as style prompts to modify the prosody and emotional information
of the given speech. In contrast to previous approaches, which often rely on
employing separate encoders like prosody and content encoders to handle
different aspects of the source speech, our model handles various information
of speech in an end-to-end manner. Experiments have demonstrated the impressive
capabilities of our model in comprehending instructions and delivering
reasonable results.Comment: Accepted to ASRU 202
Multilingual and Fully Non-Autoregressive ASR with Large Language Model Fusion: A Comprehensive Study
In the era of large models, the autoregressive nature of decoding often
results in latency serving as a significant bottleneck. We propose a
non-autoregressive LM-fused ASR system that effectively leverages the
parallelization capabilities of accelerator hardware. Our approach combines the
Universal Speech Model (USM) and the PaLM 2 language model in per-segment
scoring mode, achieving an average relative WER improvement across all
languages of 10.8% on FLEURS and 3.6% on YouTube captioning. Furthermore, our
comprehensive ablation study analyzes key parameters such as LLM size, context
length, vocabulary size, fusion methodology. For instance, we explore the
impact of LLM size ranging from 128M to 340B parameters on ASR performance.
This study provides valuable insights into the factors influencing the
effectiveness of practical large-scale LM-fused speech recognition systems.Comment: ICASSP 202