9 research outputs found
Reimagining Speech: A Scoping Review of Deep Learning-Powered Voice Conversion
Research on deep learning-powered voice conversion (VC) in speech-to-speech
scenarios is getting increasingly popular. Although many of the works in the
field of voice conversion share a common global pipeline, there is a
considerable diversity in the underlying structures, methods, and neural
sub-blocks used across research efforts. Thus, obtaining a comprehensive
understanding of the reasons behind the choice of the different methods in the
voice conversion pipeline can be challenging, and the actual hurdles in the
proposed solutions are often unclear. To shed light on these aspects, this
paper presents a scoping review that explores the use of deep learning in
speech analysis, synthesis, and disentangled speech representation learning
within modern voice conversion systems. We screened 621 publications from more
than 38 different venues between the years 2017 and 2023, followed by an
in-depth review of a final database consisting of 123 eligible studies. Based
on the review, we summarise the most frequently used approaches to voice
conversion based on deep learning and highlight common pitfalls within the
community. Lastly, we condense the knowledge gathered, identify main challenges
and provide recommendations for future research directions
An Overview of Affective Speech Synthesis and Conversion in the Deep Learning Era
Speech is the fundamental mode of human communication, and its synthesis has
long been a core priority in human-computer interaction research. In recent
years, machines have managed to master the art of generating speech that is
understandable by humans. But the linguistic content of an utterance
encompasses only a part of its meaning. Affect, or expressivity, has the
capacity to turn speech into a medium capable of conveying intimate thoughts,
feelings, and emotions -- aspects that are essential for engaging and
naturalistic interpersonal communication. While the goal of imparting
expressivity to synthesised utterances has so far remained elusive, following
recent advances in text-to-speech synthesis, a paradigm shift is well under way
in the fields of affective speech synthesis and conversion as well. Deep
learning, as the technology which underlies most of the recent advances in
artificial intelligence, is spearheading these efforts. In the present
overview, we outline ongoing trends and summarise state-of-the-art approaches
in an attempt to provide a comprehensive overview of this exciting field.Comment: Submitted to the Proceedings of IEE
Non-parallel Voice Conversion with Controllable Speaker Individuality using Variational Autoencoder
We propose a flexible non-parallel voice conversion (VC) system that is capable of both performing speaker adaptation and controlling speaker individuality. The proposed VC framework aims to tackle the inability to arbitrarily modify voice characteristics in the converted waveform of conventional VC model. To achieve this goal, we use the speaker embedding realized by a Variational Autoencoder (VAE) instead of one-hot encoded vectors to represent and modify the target voice's characteristics. Neither parallel training data, linguistic label nor time alignment procedure is required to train our system. After training on a multi-speaker speech database, the proposed VC system can adapt an arbitrary source speaker to any target speaker using only one sample from a target speaker. The speaker individuality of converted speech can be controlled by modifying the speaker embedding vectors; resulting in a fictitious speaker individuality. The experimental results showed that our proposed system is similar to conventional non-parallel VAE-based VC and better than the parallel Gaussian Mixture Model (GMM) in both perceived speech naturalness and speaker similarity; even when our system only uses one sample from target speaker. Moreover, our proposed system can convert a source voice to a fictitious target voice with well perceived speech naturalness of 3.1 MOS
Proceedings of the 19th Sound and Music Computing Conference
Proceedings of the 19th Sound and Music Computing Conference - June 5-12, 2022 - Saint-Étienne (France).
https://smc22.grame.f
KEER2022
Avanttítol: KEER2022. DiversitiesDescripció del recurs: 25 juliol 202
Fuelling the zero-emissions road freight of the future: routing of mobile fuellers
The future of zero-emissions road freight is closely tied to the sufficient availability of new and clean fuel options such as electricity and Hydrogen. In goods distribution using Electric Commercial Vehicles (ECVs) and Hydrogen Fuel Cell Vehicles (HFCVs) a major challenge in the transition period would pertain to their limited autonomy and scarce and unevenly distributed refuelling stations. One viable solution to facilitate and speed up the adoption of ECVs/HFCVs by logistics, however, is to get the fuel to the point where it is needed (instead of diverting the route of delivery vehicles to refuelling stations) using "Mobile Fuellers (MFs)". These are mobile battery swapping/recharging vans or mobile Hydrogen fuellers that can travel to a running ECV/HFCV to provide the fuel they require to complete their delivery routes at a rendezvous time and space. In this presentation, new vehicle routing models will be presented for a third party company that provides MF services. In the proposed problem variant, the MF provider company receives routing plans of multiple customer companies and has to design routes for a fleet of capacitated MFs that have to synchronise their routes with the running vehicles to deliver the required amount of fuel on-the-fly. This presentation will discuss and compare several mathematical models based on different business models and collaborative logistics scenarios