132 research outputs found
Non-Parallel Articulatory-to-Acoustic Conversion Using Multiview-based Time Warping
This work was supported in part by the Spanish State Research Agency (SRA) grant
number PID2019-108040RB-C22/SRA/10.13039/501100011033, and the FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no.
B-SEJ-570-UGR20.In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario.Spanish State Research Agency (SRA) PID2019-108040RB-C22/SRA/10.13039/501100011033FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no.
B-SEJ-570-UGR20
Non-parallel articulatory-to-acoustic conversion using multiview-based time warping
In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario
Green Communication via Power-optimized HARQ Protocols
Recently, efficient use of energy has become an essential research topic for
green communication. This paper studies the effect of optimal power controllers
on the performance of delay-sensitive communication setups utilizing hybrid
automatic repeat request (HARQ). The results are obtained for repetition time
diversity (RTD) and incremental redundancy (INR) HARQ protocols. In all cases,
the optimal power allocation, minimizing the outage-limited average
transmission power, is obtained under both continuous and bursting
communication models. Also, we investigate the system throughput in different
conditions. The results indicate that the power efficiency is increased
substantially, if adaptive power allocation is utilized. For instance, assume
Rayleigh-fading channel, a maximum of two (re)transmission rounds with rates
nats-per-channel-use and an outage probability constraint
. Then, compared to uniform power allocation, optimal power
allocation in RTD reduces the average power by 9 and 11 dB in the bursting and
continuous communication models, respectively. In INR, these values are
obtained to be 8 and 9 dB, respectively.Comment: Accepted for publication on IEEE Transactions on Vehicular Technolog
Towards speech recognition using palato-lingual contact patterns for voice restoration.
The loss of speech following a laryngectomy presents substantial challenges, and a
number of devices have been developed to assist these patients. These devices range
from the electrolarynx to tracheoesophageal speech. However, all of these devices
and techniques have concentrated on producing sound from the patient’s vocal tract.
Research into a new type of artificial larynx is presented. This new device utilizes the
measurement of dynamic tongue-palate contact patterns to infer intended speech.
The dynamic tongue measurement is achieved with the use of an existing palatome-
ter and pseudopalate. These signals are then converted to 2-D Space-Time plots and
feature extraction methods (such as Principal Component Analysis, Fourier Descrip-
tors and Generic Fourier Descriptors) are used to extract suitable features for use as
input to neural network systems. Two types of neural network (Multi-layer Percep-
trons and Support Vector Machines) are investigated and a voting system is formed.
The final system can correctly identify fifty common English words 94.14% of the
time with a rejection rate of 17.74%.
Voice morphing is investigated as a technique to match the artificially synthesized
voice to the laryngectomy patient’s original voice. It is successfully implemented
thus creating a transfer function that can change one person’s voice to sound like
another’s. Once the voting system has correctly identified the word said by the
patient the word is then synthesized in the patient’s pre-laryngectomy voice.
The final artificial larynx system solves a number of the problems inherent in previ-
ous artificial larynx designs (such as poor voice quality and invasiveness). This new
artificial larynx uses current technology in a new way to produce a viable solution
for alaryngeal patients
An Introduction to Variational Autoencoders
Variational autoencoders provide a principled framework for learning deep
latent-variable models and corresponding inference models. In this work, we
provide an introduction to variational autoencoders and some important
extensions
DMRN+18: Digital Music Research Network One-day Workshop 2023
DMRN+18: Digital Music Research Network One-day Workshop 2023 Queen Mary University of London Tuesday 19th December 2023 • Keynote speaker: Stefan Bilbao The Digital Music Research Network (DMRN) aims to promote research in the area of digital music, by bringing together researchers from UK and overseas universities, as well as industry, for its annual workshop. The workshop will include invited and contributed talks and posters. The workshop will be an ideal opportunity for networking with other people working in the area. Keynote speakers: Stefan Bilbao Tittle: Physics-based Audio: Sound Synthesis and Virtual Acoustics. Abstract: Any acoustically-produced sound produced must be the result of physical laws that describe the dynamics of a given system---always at least partly mechanical, and sometimes with an electronic element as well. One approach to the synthesis of natural acoustic timbres, thus, is through simulation, often referred to in this context as physical modelling, or physics-based audio. In this talk, the principles of physics-based audio, and the various different approaches to simulation are described, followed by a set of examples covering: various musical instrument types; the important related problem of the emulation of room acoustics or “virtual acoustics”; the embedding of instruments in a 3D virtual space; electromechanical effects; and also new modular instrument designs based on physical laws, but without a counterpart in the real world. Some more technical details follow, including the strengths, weaknesses and limitations of such methods, and pointers to some links to data-centred black-box approaches to sound generation and effects processing. The talk concludes with some musical examples and recent work on moving such algorithms to a real-time setting.. Bio: Stefan is a Professor (full) at Reid School of Music, University of Edinburgh, he is the Personal Chair of Acoustics and Audio Signal Processing, Music. He currently works on computational acoustics, for applications in sound synthesis and virtual acoustics. Special topics of interest include: Finite difference time domain methods, distributed nonlinear systems such as strings and plates, architectural acoustics, spatial audio in simulation, multichannel sound synthesis, and hardware and software realizations. More information on: https://www.acoustics.ed.ac.uk/group-members/dr-stefan-bilbao/ DMRN+18 is sponsored by The UKRI Centre for Doctoral Training in Artificial Intelligence and Music (AIM); a leading PhD research programme aimed at the Music/Audio Technology and Creative Industries, based at Queen Mary University of London
Text-Independent Voice Conversion
This thesis deals with text-independent solutions for voice conversion. It first introduces the use of vocal tract length normalization (VTLN) for voice conversion. The presented variants of VTLN allow for easily changing speaker characteristics by means of a few trainable parameters. Furthermore, it is shown how VTLN can be expressed in time domain strongly reducing the computational costs while keeping a high speech quality. The second text-independent voice conversion paradigm is residual prediction. In particular, two proposed techniques, residual smoothing and the application of unit selection, result in essential improvement of both speech quality and voice similarity. In order to apply the well-studied linear transformation paradigm to text-independent voice conversion, two text-independent speech alignment techniques are introduced. One is based on automatic segmentation and mapping of artificial phonetic classes and the other is a completely data-driven approach with unit selection. The latter achieves a performance very similar to the conventional text-dependent approach in terms of speech quality and similarity. It is also successfully applied to cross-language voice conversion. The investigations of this thesis are based on several corpora of three different languages, i.e., English, Spanish, and German. Results are also presented from the multilingual voice conversion evaluation in the framework of the international speech-to-speech translation project TC-Star
Sliding Mode Control
The main objective of this monograph is to present a broad range of well worked out, recent application studies as well as theoretical contributions in the field of sliding mode control system analysis and design. The contributions presented here include new theoretical developments as well as successful applications of variable structure controllers primarily in the field of power electronics, electric drives and motion steering systems. They enrich the current state of the art, and motivate and encourage new ideas and solutions in the sliding mode control area
An Indirect Speech Enhancement Framework Through Intermediate Noisy Speech Targets
Noise presents a severe challenge in speech communication and processing systems. Speech enhancement aims at removing the inference and restoring speech quality. It is an essential step in a speech processing pipeline in many modern electronic devices, such as mobile phones and smart speakers. Traditionally, speech engineers have relied on signal processing techniques, such as spectral subtraction or Wiener filtering. Since the advent of deep learning, data-driven methods have offered an alternative solution to speech enhancement. Researchers and engineers have proposed various neural network architectures to map noisy speech features into clean ones. In this thesis, we refer to this class of mapping based data-driven techniques collectively as a direct method in speech enhancement. The output speech from direct mapping methods usually contains noise residue and unpleasant distortion if the speech power is low relative to the noise power or the background noise is very complex. The former adverse condition refers to low signal-to-noise-ratio (SNR). The latter condition implies difficult noise types. Researchers have proposed improving the SNR of speech signal incrementally during enhancement to overcome such difficulty, known as SNR-progressive speech enhancement. This design breaks down the problem of direct mapping into manageable sub-tasks. Inspired by the previous work, we propose to adopt a multi-stage indirect approach to speech enhancement in challenging noise conditions. Unlike SNR-progressive speech enhancement, we gradually transform noisy speech from difficult background noise to speech in simple noise types. The thesis's focus will include the characterization of background noise, speech transformation techniques, and integration of an indirect speech enhancement system.Ph.D
- …