8,406 research outputs found
Low Bit-Rate Speech Coding with VQ-VAE and a WaveNet Decoder
In order to efficiently transmit and store speech signals, speech codecs
create a minimally redundant representation of the input signal which is then
decoded at the receiver with the best possible perceptual quality. In this work
we demonstrate that a neural network architecture based on VQ-VAE with a
WaveNet decoder can be used to perform very low bit-rate speech coding with
high reconstruction quality. A prosody-transparent and speaker-independent
model trained on the LibriSpeech corpus coding audio at 1.6 kbps exhibits
perceptual quality which is around halfway between the MELP codec at 2.4 kbps
and AMR-WB codec at 23.05 kbps. In addition, when training on high-quality
recorded speech with the test speaker included in the training set, a model
coding speech at 1.6 kbps produces output of similar perceptual quality to that
generated by AMR-WB at 23.05 kbps.Comment: ICASSP 201
Access to recorded interviews: A research agenda
Recorded interviews form a rich basis for scholarly inquiry. Examples include oral histories, community memory projects, and interviews conducted for broadcast media. Emerging technologies offer the potential to radically transform the way in which recorded interviews are made accessible, but this vision will demand substantial investments from a broad range of research communities. This article reviews the present state of practice for making recorded interviews available and the state-of-the-art for key component technologies. A large number of important research issues are identified, and from that set of issues, a coherent research agenda is proposed
Privacy-preserving Representation Learning for Speech Understanding
Existing privacy-preserving speech representation learning methods target a
single application domain. In this paper, we present a novel framework to
anonymize utterance-level speech embeddings generated by pre-trained encoders
and show its effectiveness for a range of speech classification tasks.
Specifically, given the representations from a pre-trained encoder, we train a
Transformer to estimate the representations for the same utterances spoken by
other speakers. During inference, the extracted representations can be
converted into different identities to preserve privacy. We compare the results
with the voice anonymization baselines from the VoicePrivacy 2022 challenge. We
evaluate our framework on speaker identification for privacy and emotion
recognition, depression classification, and intent classification for utility.
Our method outperforms the baselines on privacy and utility in paralinguistic
tasks and achieves comparable performance for intent classification.Comment: INTERSPEECH 202
Long-term Conversation Analysis: Exploring Utility and Privacy
The analysis of conversations recorded in everyday life requires privacy
protection. In this contribution, we explore a privacy-preserving feature
extraction method based on input feature dimension reduction, spectral
smoothing and the low-cost speaker anonymization technique based on McAdams
coefficient. We assess the utility of the feature extraction methods with a
voice activity detection and a speaker diarization system, while privacy
protection is determined with a speech recognition and a speaker verification
model. We show that the combination of McAdams coefficient and spectral
smoothing maintains the utility while improving privacy.Comment: Submitted to ITG Conference on Speech Communication, 202
Iteratively Improving Speech Recognition and Voice Conversion
Many existing works on voice conversion (VC) tasks use automatic speech
recognition (ASR) models for ensuring linguistic consistency between source and
converted samples. However, for the low-data resource domains, training a
high-quality ASR remains to be a challenging task. In this work, we propose a
novel iterative way of improving both the ASR and VC models. We first train an
ASR model which is used to ensure content preservation while training a VC
model. In the next iteration, the VC model is used as a data augmentation
method to further fine-tune the ASR model and generalize it to diverse
speakers. By iteratively leveraging the improved ASR model to train VC model
and vice-versa, we experimentally show improvement in both the models. Our
proposed framework outperforms the ASR and one-shot VC baseline models on
English singing and Hindi speech domains in subjective and objective
evaluations in low-data resource settings
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
Deep learning for speech enhancement : a thesis presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Albany, New Zealand
Speech enhancement, aiming at improving the intelligibility and overall perceptual quality of a contaminated speech signal, is an effective way to improve speech communications. In this thesis, we propose three novel deep learning methods to improve speech enhancement performance.
Firstly, we propose an adversarial latent representation learning for latent space exploration of generative adversarial network based speech enhancement. Based on adversarial feature learning, this method employs an extra encoder to learn an inverse mapping from the generated data distribution to the latent space. The encoder establishes an inner connection with the generator and contributes to latent information learning.
Secondly, we propose an adversarial multi-task learning with inverse mappings method for effective speech representation. This speech enhancement method focuses on enhancing the generator's capability of speech information capture and representation learning. To implement this method, two extra networks are developed to learn the inverse mappings from the generated distribution to the input data domains.
Thirdly, we propose a self-supervised learning based phone-fortified method to improve specific speech characteristics learning for speech enhancement. This method explicitly imports phonetic characteristics into a deep complex convolutional network via a contrastive predictive coding model pre-trained with self-supervised learning.
The experimental results demonstrate that the proposed methods outperform previous speech enhancement methods and achieve state-of-the-art performance in terms of speech intelligibility and overall perceptual quality
A Computational Model of Auditory Feature Extraction and Sound Classification
This thesis introduces a computer model that incorporates responses similar to
those found in the cochlea, in sub-corticai auditory processing, and in auditory
cortex. The principle aim of this work is to show that this can form the basis
for a biologically plausible mechanism of auditory stimulus classification. We will
show that this classification is robust to stimulus variation and time compression.
In addition, the response of the system is shown to support multiple, concurrent,
behaviourally relevant classifications of natural stimuli (speech).
The model incorporates transient enhancement, an ensemble of spectro -
temporal filters, and a simple measure analogous to the idea of visual salience
to produce a quasi-static description of the stimulus suitable either for classification
with an analogue artificial neural network or, using appropriate rate coding,
a classifier based on artificial spiking neurons. We also show that the spectotemporal
ensemble can be derived from a limited class of 'formative' stimuli, consistent
with a developmental interpretation of ensemble formation. In addition,
ensembles chosen on information theoretic grounds consist of filters with relatively
simple geometries, which is consistent with reports of responses in mammalian
thalamus and auditory cortex.
A powerful feature of this approach is that the ensemble response, from
which salient auditory events are identified, amounts to stimulus-ensemble driven
method of segmentation which respects the envelope of the stimulus, and leads
to a quasi-static representation of auditory events which is suitable for spike rate
coding.
We also present evidence that the encoded auditory events may form the
basis of a representation-of-similarity, or second order isomorphism, which implies
a representational space that respects similarity relationships between stimuli
including novel stimuli
- âŚ