132 research outputs found

    Non-Parallel Articulatory-to-Acoustic Conversion Using Multiview-based Time Warping

    Get PDF
    This work was supported in part by the Spanish State Research Agency (SRA) grant number PID2019-108040RB-C22/SRA/10.13039/501100011033, and the FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no. B-SEJ-570-UGR20.In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario.Spanish State Research Agency (SRA) PID2019-108040RB-C22/SRA/10.13039/501100011033FEDER/Junta de AndalucíaConsejería de Transformación Económica, Industria, Conocimiento y Universidades project no. B-SEJ-570-UGR20

    Non-parallel articulatory-to-acoustic conversion using multiview-based time warping

    Get PDF
    In this paper, we propose a novel algorithm called multiview temporal alignment by dependence maximisation in the latent space (TRANSIENCE) for the alignment of time series consisting of sequences of feature vectors with different length and dimensionality of the feature vectors. The proposed algorithm, which is based on the theory of multiview learning, can be seen as an extension of the well-known dynamic time warping (DTW) algorithm but, as mentioned, it allows the sequences to have different dimensionalities. Our algorithm attempts to find an optimal temporal alignment between pairs of nonaligned sequences by first projecting their feature vectors into a common latent space where both views are maximally similar. To do this, powerful, nonlinear deep neural network (DNN) models are employed. Then, the resulting sequences of embedding vectors are aligned using DTW. Finally, the alignment paths obtained in the previous step are applied to the original sequences to align them. In the paper, we explore several variants of the algorithm that mainly differ in the way the DNNs are trained. We evaluated the proposed algorithm on a articulatory-to-acoustic (A2A) synthesis task involving the generation of audible speech from motion data captured from the lips and tongue of healthy speakers using a technique known as permanent magnet articulography (PMA). In this task, our algorithm is applied during the training stage to align pairs of nonaligned speech and PMA recordings that are later used to train DNNs able to synthesis speech from PMA data. Our results show the quality of speech generated in the nonaligned scenario is comparable to that obtained in the parallel scenario

    Green Communication via Power-optimized HARQ Protocols

    Get PDF
    Recently, efficient use of energy has become an essential research topic for green communication. This paper studies the effect of optimal power controllers on the performance of delay-sensitive communication setups utilizing hybrid automatic repeat request (HARQ). The results are obtained for repetition time diversity (RTD) and incremental redundancy (INR) HARQ protocols. In all cases, the optimal power allocation, minimizing the outage-limited average transmission power, is obtained under both continuous and bursting communication models. Also, we investigate the system throughput in different conditions. The results indicate that the power efficiency is increased substantially, if adaptive power allocation is utilized. For instance, assume Rayleigh-fading channel, a maximum of two (re)transmission rounds with rates {1,12}\{1,\frac{1}{2}\} nats-per-channel-use and an outage probability constraint 103{10}^{-3}. Then, compared to uniform power allocation, optimal power allocation in RTD reduces the average power by 9 and 11 dB in the bursting and continuous communication models, respectively. In INR, these values are obtained to be 8 and 9 dB, respectively.Comment: Accepted for publication on IEEE Transactions on Vehicular Technolog

    Towards speech recognition using palato-lingual contact patterns for voice restoration.

    Get PDF
    The loss of speech following a laryngectomy presents substantial challenges, and a number of devices have been developed to assist these patients. These devices range from the electrolarynx to tracheoesophageal speech. However, all of these devices and techniques have concentrated on producing sound from the patient’s vocal tract. Research into a new type of artificial larynx is presented. This new device utilizes the measurement of dynamic tongue-palate contact patterns to infer intended speech. The dynamic tongue measurement is achieved with the use of an existing palatome- ter and pseudopalate. These signals are then converted to 2-D Space-Time plots and feature extraction methods (such as Principal Component Analysis, Fourier Descrip- tors and Generic Fourier Descriptors) are used to extract suitable features for use as input to neural network systems. Two types of neural network (Multi-layer Percep- trons and Support Vector Machines) are investigated and a voting system is formed. The final system can correctly identify fifty common English words 94.14% of the time with a rejection rate of 17.74%. Voice morphing is investigated as a technique to match the artificially synthesized voice to the laryngectomy patient’s original voice. It is successfully implemented thus creating a transfer function that can change one person’s voice to sound like another’s. Once the voting system has correctly identified the word said by the patient the word is then synthesized in the patient’s pre-laryngectomy voice. The final artificial larynx system solves a number of the problems inherent in previ- ous artificial larynx designs (such as poor voice quality and invasiveness). This new artificial larynx uses current technology in a new way to produce a viable solution for alaryngeal patients

    An Introduction to Variational Autoencoders

    Full text link
    Variational autoencoders provide a principled framework for learning deep latent-variable models and corresponding inference models. In this work, we provide an introduction to variational autoencoders and some important extensions

    DMRN+18: Digital Music Research Network One-day Workshop 2023

    Get PDF
    DMRN+18: Digital Music Research Network One-day Workshop 2023 Queen Mary University of London Tuesday 19th December 2023 • Keynote speaker: Stefan Bilbao The Digital Music Research Network (DMRN) aims to promote research in the area of digital music, by bringing together researchers from UK and overseas universities, as well as industry, for its annual workshop. The workshop will include invited and contributed talks and posters. The workshop will be an ideal opportunity for networking with other people working in the area. Keynote speakers: Stefan Bilbao Tittle: Physics-based Audio: Sound Synthesis and Virtual Acoustics. Abstract: Any acoustically-produced sound produced must be the result of physical laws that describe the dynamics of a given system---always at least partly mechanical, and sometimes with an electronic element as well. One approach to the synthesis of natural acoustic timbres, thus, is through simulation, often referred to in this context as physical modelling, or physics-based audio. In this talk, the principles of physics-based audio, and the various different approaches to simulation are described, followed by a set of examples covering: various musical instrument types; the important related problem of the emulation of room acoustics or “virtual acoustics”; the embedding of instruments in a 3D virtual space; electromechanical effects; and also new modular instrument designs based on physical laws, but without a counterpart in the real world. Some more technical details follow, including the strengths, weaknesses and limitations of such methods, and pointers to some links to data-centred black-box approaches to sound generation and effects processing. The talk concludes with some musical examples and recent work on moving such algorithms to a real-time setting.. Bio: Stefan is a Professor (full) at Reid School of Music, University of Edinburgh, he is the Personal Chair of Acoustics and Audio Signal Processing, Music. He currently works on computational acoustics, for applications in sound synthesis and virtual acoustics. Special topics of interest include: Finite difference time domain methods, distributed nonlinear systems such as strings and plates, architectural acoustics, spatial audio in simulation, multichannel sound synthesis, and hardware and software realizations. More information on: https://www.acoustics.ed.ac.uk/group-members/dr-stefan-bilbao/ DMRN+18 is sponsored by The UKRI Centre for Doctoral Training in Artificial Intelligence and Music (AIM); a leading PhD research programme aimed at the Music/Audio Technology and Creative Industries, based at Queen Mary University of London

    Text-Independent Voice Conversion

    Get PDF
    This thesis deals with text-independent solutions for voice conversion. It first introduces the use of vocal tract length normalization (VTLN) for voice conversion. The presented variants of VTLN allow for easily changing speaker characteristics by means of a few trainable parameters. Furthermore, it is shown how VTLN can be expressed in time domain strongly reducing the computational costs while keeping a high speech quality. The second text-independent voice conversion paradigm is residual prediction. In particular, two proposed techniques, residual smoothing and the application of unit selection, result in essential improvement of both speech quality and voice similarity. In order to apply the well-studied linear transformation paradigm to text-independent voice conversion, two text-independent speech alignment techniques are introduced. One is based on automatic segmentation and mapping of artificial phonetic classes and the other is a completely data-driven approach with unit selection. The latter achieves a performance very similar to the conventional text-dependent approach in terms of speech quality and similarity. It is also successfully applied to cross-language voice conversion. The investigations of this thesis are based on several corpora of three different languages, i.e., English, Spanish, and German. Results are also presented from the multilingual voice conversion evaluation in the framework of the international speech-to-speech translation project TC-Star

    Sliding Mode Control

    Get PDF
    The main objective of this monograph is to present a broad range of well worked out, recent application studies as well as theoretical contributions in the field of sliding mode control system analysis and design. The contributions presented here include new theoretical developments as well as successful applications of variable structure controllers primarily in the field of power electronics, electric drives and motion steering systems. They enrich the current state of the art, and motivate and encourage new ideas and solutions in the sliding mode control area

    An Indirect Speech Enhancement Framework Through Intermediate Noisy Speech Targets

    Get PDF
    Noise presents a severe challenge in speech communication and processing systems. Speech enhancement aims at removing the inference and restoring speech quality. It is an essential step in a speech processing pipeline in many modern electronic devices, such as mobile phones and smart speakers. Traditionally, speech engineers have relied on signal processing techniques, such as spectral subtraction or Wiener filtering. Since the advent of deep learning, data-driven methods have offered an alternative solution to speech enhancement. Researchers and engineers have proposed various neural network architectures to map noisy speech features into clean ones. In this thesis, we refer to this class of mapping based data-driven techniques collectively as a direct method in speech enhancement. The output speech from direct mapping methods usually contains noise residue and unpleasant distortion if the speech power is low relative to the noise power or the background noise is very complex. The former adverse condition refers to low signal-to-noise-ratio (SNR). The latter condition implies difficult noise types. Researchers have proposed improving the SNR of speech signal incrementally during enhancement to overcome such difficulty, known as SNR-progressive speech enhancement. This design breaks down the problem of direct mapping into manageable sub-tasks. Inspired by the previous work, we propose to adopt a multi-stage indirect approach to speech enhancement in challenging noise conditions. Unlike SNR-progressive speech enhancement, we gradually transform noisy speech from difficult background noise to speech in simple noise types. The thesis's focus will include the characterization of background noise, speech transformation techniques, and integration of an indirect speech enhancement system.Ph.D
    corecore