378 research outputs found
Overlapped Speech Detection in Multi-Party Meetings
Detection of simultaneous speech in meeting recordings is a difficult problem due both to the complexity of the meeting itself and the environment surrounding it. The system proposes the use of gammatone-like spectrogram-based linear predictor coefficients on distant microphone channel data for overlap detection functions. The framework utilized the Augmented Multiparty Interaction (AMI) conference corpus to assess model performance. The proposed system offers enhancements over base line feature set models for classification
PP-MeT: a Real-world Personalized Prompt based Meeting Transcription System
Speaker-attributed automatic speech recognition (SA-ASR) improves the
accuracy and applicability of multi-speaker ASR systems in real-world scenarios
by assigning speaker labels to transcribed texts. However, SA-ASR poses unique
challenges due to factors such as speaker overlap, speaker variability,
background noise, and reverberation. In this study, we propose PP-MeT system, a
real-world personalized prompt based meeting transcription system, which
consists of a clustering system, target-speaker voice activity detection
(TS-VAD), and TS-ASR. Specifically, we utilize target-speaker embedding as a
prompt in TS-VAD and TS-ASR modules in our proposed system. In constrast with
previous system, we fully leverage pre-trained models for system
initialization, thereby bestowing our approach with heightened generalizability
and precision. Experiments on M2MeT2.0 Challenge dataset show that our system
achieves a cp-CER of 11.27% on the test set, ranking first in both fixed and
open training conditions
Frame-wise streaming end-to-end speaker diarization with non-autoregressive self-attention-based attractors
This work proposes a frame-wise online/streaming end-to-end neural
diarization (FS-EEND) method in a frame-in-frame-out fashion. To frame-wisely
detect a flexible number of speakers and extract/update their corresponding
attractors, we propose to leverage a causal speaker embedding encoder and an
online non-autoregressive self-attention-based attractor decoder. A look-ahead
mechanism is adopted to allow leveraging some future frames for effectively
detecting new speakers in real time and adaptively updating speaker attractors.
The proposed method processes the audio stream frame by frame, and has a low
inference latency caused by the look-ahead frames. Experiments show that,
compared with the recently proposed block-wise online methods, our method
FS-EEND achieves state-of-the-art diarization results, with a low inference
latency and computational cost
Robust End-to-End Diarization with Domain Adaptive Training and Multi-Task Learning
Due to the scarcity of publicly available diarization data, the model
performance can be improved by training a single model with data from different
domains. In this work, we propose to incorporate domain information to train a
single end-to-end diarization model for multiple domains. First, we employ
domain adaptive training with parameter-efficient adapters for on-the-fly model
reconfiguration. Second, we introduce an auxiliary domain classification task
to make the diarization model more domain-aware. For seen domains, the
combination of our proposed methods reduces the absolute DER from 17.66% to
16.59% when compared with the baseline. During inference, adapters from
ground-truth domains are not available for unseen domains. We demonstrate our
model exhibits a stronger generalizability to unseen domains when adapters are
removed. For two unseen domains, this improves the DER performance from 39.91%
to 23.09% and 25.32% to 18.76% over the baseline, respectively.Comment: 7 pages, 2 figures, ASRU 202
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
Attention-based Encoder-Decoder End-to-End Neural Diarization with Embedding Enhancer
Deep neural network-based systems have significantly improved the performance
of speaker diarization tasks. However, end-to-end neural diarization (EEND)
systems often struggle to generalize to scenarios with an unseen number of
speakers, while target speaker voice activity detection (TS-VAD) systems tend
to be overly complex. In this paper, we propose a simple attention-based
encoder-decoder network for end-to-end neural diarization (AED-EEND). In our
training process, we introduce a teacher-forcing strategy to address the
speaker permutation problem, leading to faster model convergence. For
evaluation, we propose an iterative decoding method that outputs diarization
results for each speaker sequentially. Additionally, we propose an Enhancer
module to enhance the frame-level speaker embeddings, enabling the model to
handle scenarios with an unseen number of speakers. We also explore replacing
the transformer encoder with a Conformer architecture, which better models
local information. Furthermore, we discovered that commonly used simulation
datasets for speaker diarization have a much higher overlap ratio compared to
real data. We found that using simulated training data that is more consistent
with real data can achieve an improvement in consistency. Extensive
experimental validation demonstrates the effectiveness of our proposed
methodologies. Our best system achieved a new state-of-the-art diarization
error rate (DER) performance on all the CALLHOME (10.08%), DIHARD II (24.64%),
and AMI (13.00%) evaluation benchmarks, when no oracle voice activity detection
(VAD) is used. Beyond speaker diarization, our AED-EEND system also shows
remarkable competitiveness as a speech type detection model.Comment: IEEE/ACM Transactions on Audio Speech and Language Processing Under
Revie
An Experimental Review of Speaker Diarization methods with application to Two-Speaker Conversational Telephone Speech recordings
We performed an experimental review of current diarization systems for the
conversational telephone speech (CTS) domain. In detail, we considered a total
of eight different algorithms belonging to clustering-based, end-to-end neural
diarization (EEND), and speech separation guided diarization (SSGD) paradigms.
We studied the inference-time computational requirements and diarization
accuracy on four CTS datasets with different characteristics and languages. We
found that, among all methods considered, EEND-vector clustering (EEND-VC)
offers the best trade-off in terms of computing requirements and performance.
More in general, EEND models have been found to be lighter and faster in
inference compared to clustering-based methods. However, they also require a
large amount of diarization-oriented annotated data. In particular EEND-VC
performance in our experiments degraded when the dataset size was reduced,
whereas self-attentive EEND (SA-EEND) was less affected. We also found that
SA-EEND gives less consistent results among all the datasets compared to
EEND-VC, with its performance degrading on long conversations with high speech
sparsity. Clustering-based diarization systems, and in particular VBx, instead
have more consistent performance compared to SA-EEND but are outperformed by
EEND-VC. The gap with respect to this latter is reduced when overlap-aware
clustering methods are considered. SSGD is the most computationally demanding
method, but it could be convenient if speech recognition has to be performed.
Its performance is close to SA-EEND but degrades significantly when the
training and inference data characteristics are less matched.Comment: 52 pages, 10 figure
Energy-based Self-attentive Learning of Abstractive Communities for Spoken Language Understanding
Abstractive community detection is an important spoken language understanding
task, whose goal is to group utterances in a conversation according to whether
they can be jointly summarized by a common abstractive sentence. This paper
provides a novel approach to this task. We first introduce a neural contextual
utterance encoder featuring three types of self-attention mechanisms. We then
train it using the siamese and triplet energy-based meta-architectures.
Experiments on the AMI corpus show that our system outperforms multiple
energy-based and non-energy based baselines from the state-of-the-art. Code and
data are publicly available.Comment: Update baseline
- …