8,480 research outputs found
Improvement of speech recognition by nonlinear noise reduction
The success of nonlinear noise reduction applied to a single channel
recording of human voice is measured in terms of the recognition rate of a
commercial speech recognition program in comparison to the optimal linear
filter. The overall performance of the nonlinear method is shown to be
superior. We hence demonstrate that an algorithm which has its roots in the
theory of nonlinear deterministic dynamics possesses a large potential in a
realistic application.Comment: see urbanowicz.org.p
Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech
Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version
Voice Conversion Based on Cross-Domain Features Using Variational Auto Encoders
An effective approach to non-parallel voice conversion (VC) is to utilize
deep neural networks (DNNs), specifically variational auto encoders (VAEs), to
model the latent structure of speech in an unsupervised manner. A previous
study has confirmed the ef- fectiveness of VAE using the STRAIGHT spectra for
VC. How- ever, VAE using other types of spectral features such as mel- cepstral
coefficients (MCCs), which are related to human per- ception and have been
widely used in VC, have not been prop- erly investigated. Instead of using one
specific type of spectral feature, it is expected that VAE may benefit from
using multi- ple types of spectral features simultaneously, thereby improving
the capability of VAE for VC. To this end, we propose a novel VAE framework
(called cross-domain VAE, CDVAE) for VC. Specifically, the proposed framework
utilizes both STRAIGHT spectra and MCCs by explicitly regularizing multiple
objectives in order to constrain the behavior of the learned encoder and de-
coder. Experimental results demonstrate that the proposed CD- VAE framework
outperforms the conventional VAE framework in terms of subjective tests.Comment: Accepted to ISCSLP 201
CLN-VC: Text-Free Voice Conversion Based on Fine-Grained Style Control and Contrastive Learning with Negative Samples Augmentation
Better disentanglement of speech representation is essential to improve the
quality of voice conversion. Recently contrastive learning is applied to voice
conversion successfully based on speaker labels. However, the performance of
model will reduce in conversion between similar speakers. Hence, we propose an
augmented negative sample selection to address the issue. Specifically, we
create hard negative samples based on the proposed speaker fusion module to
improve learning ability of speaker encoder. Furthermore, considering the
fine-grain modeling of speaker style, we employ a reference encoder to extract
fine-grained style and conduct the augmented contrastive learning on global
style. The experimental results show that the proposed method outperforms
previous work in voice conversion tasks.Comment: Accepted by the 21st IEEE International Symposium on Parallel and
Distributed Processing with Applications (IEEE ISPA 2023
On Feature Importance and Interpretability of Speaker Representations
Unsupervised speech disentanglement aims at separating fast varying from
slowly varying components of a speech signal. In this contribution, we take a
closer look at the embedding vector representing the slowly varying signal
components, commonly named the speaker embedding vector. We ask, which
properties of a speaker's voice are captured and investigate to which extent do
individual embedding vector components sign responsible for them, using the
concept of Shapley values. Our findings show that certain speaker-specific
acoustic-phonetic properties can be fairly well predicted from the speaker
embedding, while the investigated more abstract voice quality features cannot.Comment: Presented at the ITG conference on Speech Communication 202
On the Impact of Voice Anonymization on Speech-Based COVID-19 Detection
With advances seen in deep learning, voice-based applications are burgeoning,
ranging from personal assistants, affective computing, to remote disease
diagnostics. As the voice contains both linguistic and paralinguistic
information (e.g., vocal pitch, intonation, speech rate, loudness), there is
growing interest in voice anonymization to preserve speaker privacy and
identity. Voice privacy challenges have emerged over the last few years and
focus has been placed on removing speaker identity while keeping linguistic
content intact. For affective computing and disease monitoring applications,
however, the paralinguistic content may be more critical. Unfortunately, the
effects that anonymization may have on these systems are still largely unknown.
In this paper, we fill this gap and focus on one particular health monitoring
application: speech-based COVID-19 diagnosis. We test two popular anonymization
methods and their impact on five different state-of-the-art COVID-19 diagnostic
systems using three public datasets. We validate the effectiveness of the
anonymization methods, compare their computational complexity, and quantify the
impact across different testing scenarios for both within- and across-dataset
conditions. Lastly, we show the benefits of anonymization as a data
augmentation tool to help recover some of the COVID-19 diagnostic accuracy loss
seen with anonymized data.Comment: 11 pages, 10 figure
- …