Search CORE

4,141 research outputs found

Data-driven Extraction of Intonation Contour Classes

Author: Reichel Uwe D.
Publication venue
Publication date: 01/01/2007
Field of study

In this paper we introduce the first steps towards a new datadriven method for extraction of intonation events that does not require any prerequisite prosodic labelling. Provided with data segmented on the syllable constituent level it derives local and global contour classes by stylisation and subsequent clustering of the stylisation parameter vectors. Local contour classes correspond to pitch movements connected to one or several syllables and determine the local f0 shape. Global classes are connected to intonation phrases and determine the f0 register. Local classes initially are derived for syllabic segments, which are then concatenated incrementally by means of statistical language modelling of co-occurrence patterns. Due to its generality the method is in principal language independent and potentially capable to deal also with other aspects of prosody than intonation. 1

CiteSeerX

Open Access LMU

Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

Author: Bonafonte Cávez Antonio
Pascual de la Puente Santiago
Serra Joan
Publication venue: 'MDPI AG'
Publication date: 01/01/2019
Field of study

Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version

Multidisciplinary Digital Publishing Institute

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

UPCommons. Portal del coneixement obert de la UPC

Robust Speaker Identication against Computer Aided Voice Impersonation

Author: Haider Zargham
Publication venue
Publication date: 01/12/2011
Field of study

University of Surrey

Surrey Research Insight

Internal, Literature review. Automated Planetary Image Analysis and Associated Literature

Author: Paul Tar
Publication venue
Publication date
Field of study

CiteSeerX

Overcoming the limitations of statistical parametric speech synthesis

Author: Merritt Thomas
Publication venue: The University of Edinburgh
Publication date: 07/07/2017
Field of study

At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis

Edinburgh Research Archive

Modelling and detection of faults in axial-flux permanent magnet machines

Author: Ogidi Oladapo Omotade
Publication venue: Department of Electrical Engineering
Publication date: 01/01/2016
Field of study

The development of various topologies and configurations of axial-flux permanent magnet machine has spurred its use for electromechanical energy conversion in several applications. As it becomes increasingly deployed, effective condition monitoring built on reliable and accurate fault detection techniques is needed to ensure its engineering integrity. Unlike induction machine which has been rigorously investigated for faults, axial-flux permanent magnet machine has not. Thus in this thesis, axial-flux permanent magnet machine is investigated under faulty conditions. Common faults associated with it namely; static eccentricity and interturn short circuit are modelled, and detection techniques are established. The modelling forms a basis for; developing a platform for precise fault replication on a developed experimental test-rig, predicting and analysing fault signatures using both finite element analysis and experimental analysis. In the detection, the motor current signature analysis, vibration analysis and electrical impedance spectroscopy are applied. Attention is paid to fault-feature extraction and fault discrimination. Using both frequency and time-frequency techniques, features are tracked in the line current under steady-state and transient conditions respectively. Results obtained provide rich information on the pattern of fault harmonics. Parametric spectral estimation is also explored as an alternative to the Fourier transform in the steady-state analysis of faulty conditions. It is found to be as effective as the Fourier transform and more amenable to short signal-measurement duration. Vibration analysis is applied in the detection of eccentricities; its efficacy in fault detection is hinged on proper determination of vibratory frequencies and quantification of corresponding tones. This is achieved using analytical formulations and signal processing techniques. Furthermore, the developed fault model is used to assess the influence of cogging torque minimization techniques and rotor topologies in axial-flux permanent magnet machine on current signal in the presence of static eccentricity. The double-sided topology is found to be tolerant to the presence of static eccentricity unlike the single-sided topology due to the opposing effect of the resulting asymmetrical properties of the airgap. The cogging torque minimization techniques do not impair on the established fault detection technique in the single-sided topology. By applying electrical broadband impedance spectroscopy, interturn faults are diagnosed; a high frequency winding model is developed to analyse the impedance-frequency response obtained

Cape Town University OpenUCT

Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations

Author: Henter Gustav Eje
Valentini-Botinhao Cassia
Wester Mirjam
Publication venue
Publication date: 10/09/2015
Field of study

Tallying the numbers of listeners that took part in subjective evaluations of synthetic speech at Interspeech 2014 showed that in more than 60 % of papers conclusions are based on listen-ing tests with less than 20 listeners. Our analysis of Blizzard 2013 data shows that for a MOS test measuring naturalness a stable level of significance is only reached when more than 30 listeners are used. In this paper, we set out a list of guidelines, i.e., a checklist for carrying out meaningful subjective evalua-tions. We further illustrate the importance of sentence coverage and number of listeners by presenting changes to rank order and number of significant pairs by re-analysing data from the Bliz-zard Challenge 2013. Index Terms: Subjective evaluation, text-to-speech, MOS tes

CiteSeerX

Edinburgh Research Explorer

Time-domain speech enhancement using generative adversarial networks

Author: Bonafonte Cávez Antonio
Pascual de la Puente Santiago
Serra Joan
Publication venue: 'Elsevier BV'
Publication date: 01/11/2019
Field of study

Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals.Peer ReviewedPostprint (author's final draft

UPCommons. Portal del coneixement obert de la UPC