1,635 research outputs found
Towards High-Quality Neural TTS for Low-Resource Languages by Learning Compact Speech Representations
This paper aims to enhance low-resource TTS by reducing training data
requirements using compact speech representations. A Multi-Stage Multi-Codebook
(MSMC) VQ-GAN is trained to learn the representation, MSMCR, and decode it to
waveforms. Subsequently, we train the multi-stage predictor to predict MSMCRs
from the text for TTS synthesis. Moreover, we optimize the training strategy by
leveraging more audio to learn MSMCRs better for low-resource languages. It
selects audio from other languages using speaker similarity metric to augment
the training set, and applies transfer learning to improve training quality. In
MOS tests, the proposed system significantly outperforms FastSpeech and VITS in
standard and low-resource scenarios, showing lower data requirements. The
proposed training strategy effectively enhances MSMCRs on waveform
reconstruction. It improves TTS performance further, which wins 77% votes in
the preference test for the low-resource TTS with only 15 minutes of paired
data.Comment: Submitted to ICASSP 202
A Review of Deep Learning Techniques for Speech Processing
The field of speech processing has undergone a transformative shift with the
advent of deep learning. The use of multiple processing layers has enabled the
creation of models capable of extracting intricate features from speech data.
This development has paved the way for unparalleled advancements in speech
recognition, text-to-speech synthesis, automatic speech recognition, and
emotion recognition, propelling the performance of these tasks to unprecedented
heights. The power of deep learning techniques has opened up new avenues for
research and innovation in the field of speech processing, with far-reaching
implications for a range of industries and applications. This review paper
provides a comprehensive overview of the key deep learning models and their
applications in speech-processing tasks. We begin by tracing the evolution of
speech processing research, from early approaches, such as MFCC and HMM, to
more recent advances in deep learning architectures, such as CNNs, RNNs,
transformers, conformers, and diffusion models. We categorize the approaches
and compare their strengths and weaknesses for solving speech-processing tasks.
Furthermore, we extensively cover various speech-processing tasks, datasets,
and benchmarks used in the literature and describe how different deep-learning
networks have been utilized to tackle these tasks. Additionally, we discuss the
challenges and future directions of deep learning in speech processing,
including the need for more parameter-efficient, interpretable models and the
potential of deep learning for multimodal speech processing. By examining the
field's evolution, comparing and contrasting different approaches, and
highlighting future directions and challenges, we hope to inspire further
research in this exciting and rapidly advancing field
HANSEN: Human and AI Spoken Text Benchmark for Authorship Analysis
Authorship Analysis, also known as stylometry, has been an essential aspect
of Natural Language Processing (NLP) for a long time. Likewise, the recent
advancement of Large Language Models (LLMs) has made authorship analysis
increasingly crucial for distinguishing between human-written and AI-generated
texts. However, these authorship analysis tasks have primarily been focused on
written texts, not considering spoken texts. Thus, we introduce the largest
benchmark for spoken texts - HANSEN (Human ANd ai Spoken tExt beNchmark).
HANSEN encompasses meticulous curation of existing speech datasets accompanied
by transcripts, alongside the creation of novel AI-generated spoken text
datasets. Together, it comprises 17 human datasets, and AI-generated spoken
texts created using 3 prominent LLMs: ChatGPT, PaLM2, and Vicuna13B. To
evaluate and demonstrate the utility of HANSEN, we perform Authorship
Attribution (AA) & Author Verification (AV) on human-spoken datasets and
conducted Human vs. AI spoken text detection using state-of-the-art (SOTA)
models. While SOTA methods, such as, character ngram or Transformer-based
model, exhibit similar AA & AV performance in human-spoken datasets compared to
written ones, there is much room for improvement in AI-generated spoken text
detection. The HANSEN benchmark is available at:
https://huggingface.co/datasets/HANSEN-REPO/HANSEN.Comment: 9 pages, EMNLP-23 findings, 5 pages appendix, 6 figures, 17 table
About Voice: A Longitudinal Study of Speaker Recognition Dataset Dynamics
Like face recognition, speaker recognition is widely used for voice-based
biometric identification in a broad range of industries, including banking,
education, recruitment, immigration, law enforcement, healthcare, and
well-being. However, while dataset evaluations and audits have improved data
practices in computer vision and face recognition, the data practices in
speaker recognition have gone largely unquestioned. Our research aims to
address this gap by exploring how dataset usage has evolved over time and what
implications this has on bias and fairness in speaker recognition systems.
Previous studies have demonstrated the presence of historical, representation,
and measurement biases in popular speaker recognition benchmarks. In this
paper, we present a longitudinal study of speaker recognition datasets used for
training and evaluation from 2012 to 2021. We survey close to 700 papers to
investigate community adoption of datasets and changes in usage over a crucial
time period where speaker recognition approaches transitioned to the widespread
adoption of deep neural networks. Our study identifies the most commonly used
datasets in the field, examines their usage patterns, and assesses their
attributes that affect bias, fairness, and other ethical concerns. Our findings
suggest areas for further research on the ethics and fairness of speaker
recognition technology.Comment: 14 pages (23 with References and Appendix
Adaptation Algorithms for Neural Network-Based Speech Recognition: An Overview
We present a structured overview of adaptation algorithms for neural
network-based speech recognition, considering both hybrid hidden Markov model /
neural network systems and end-to-end neural network systems, with a focus on
speaker adaptation, domain adaptation, and accent adaptation. The overview
characterizes adaptation algorithms as based on embeddings, model parameter
adaptation, or data augmentation. We present a meta-analysis of the performance
of speech recognition adaptation algorithms, based on relative error rate
reductions as reported in the literature.Comment: Submitted to IEEE Open Journal of Signal Processing. 30 pages, 27
figure
End-to-end Online Speaker Diarization with Target Speaker Tracking
This paper proposes an online target speaker voice activity detection system
for speaker diarization tasks, which does not require a priori knowledge from
the clustering-based diarization system to obtain the target speaker
embeddings. By adapting the conventional target speaker voice activity
detection for real-time operation, this framework can identify speaker
activities using self-generated embeddings, resulting in consistent performance
without permutation inconsistencies in the inference phase. During the
inference process, we employ a front-end model to extract the frame-level
speaker embeddings for each coming block of a signal. Next, we predict the
detection state of each speaker based on these frame-level speaker embeddings
and the previously estimated target speaker embedding. Then, the target speaker
embeddings are updated by aggregating these frame-level speaker embeddings
according to the predictions in the current block. Our model predicts the
results for each block and updates the target speakers' embeddings until
reaching the end of the signal. Experimental results show that the proposed
method outperforms the offline clustering-based diarization system on the
DIHARD III and AliMeeting datasets. The proposed method is further extended to
multi-channel data, which achieves similar performance with the
state-of-the-art offline diarization systems.Comment: Submitted to IEEE/ACM Transactions on Audio, Speech, and Language
Processin
- …