122 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
A Comparison of Discrete and Soft Speech Units for Improved Voice Conversion
The goal of voice conversion is to transform source speech into a target
voice, keeping the content unchanged. In this paper, we focus on
self-supervised representation learning for voice conversion. Specifically, we
compare discrete and soft speech units as input features. We find that discrete
representations effectively remove speaker information but discard some
linguistic content - leading to mispronunciations. As a solution, we propose
soft speech units. To learn soft units, we predict a distribution over discrete
speech units. By modeling uncertainty, soft units capture more content
information, improving the intelligibility and naturalness of converted speech.
Samples available at https://ubisoft-laforge.github.io/speech/soft-vc/. Code
available at https://github.com/bshall/soft-vc/.Comment: 5 pages, 2 figures, 2 tables. Accepted at ICASSP 202
Voice Conversion With Just Nearest Neighbors
Any-to-any voice conversion aims to transform source speech into a target
voice with just a few examples of the target speaker as a reference. Recent
methods produce convincing conversions, but at the cost of increased complexity
-- making results difficult to reproduce and build on. Instead, we keep it
simple. We propose k-nearest neighbors voice conversion (kNN-VC): a
straightforward yet effective method for any-to-any conversion. First, we
extract self-supervised representations of the source and reference speech. To
convert to the target speaker, we replace each frame of the source
representation with its nearest neighbor in the reference. Finally, a
pretrained vocoder synthesizes audio from the converted representation.
Objective and subjective evaluations show that kNN-VC improves speaker
similarity with similar intelligibility scores to existing methods. Code,
samples, trained models: https://bshall.github.io/knn-vcComment: 5 page, 1 table, 2 figures. Accepted at Interspeech 202
- …