4,141 research outputs found

    Data-driven Extraction of Intonation Contour Classes

    Get PDF
    In this paper we introduce the first steps towards a new datadriven method for extraction of intonation events that does not require any prerequisite prosodic labelling. Provided with data segmented on the syllable constituent level it derives local and global contour classes by stylisation and subsequent clustering of the stylisation parameter vectors. Local contour classes correspond to pitch movements connected to one or several syllables and determine the local f0 shape. Global classes are connected to intonation phrases and determine the f0 register. Local classes initially are derived for syllabic segments, which are then concatenated incrementally by means of statistical language modelling of co-occurrence patterns. Due to its generality the method is in principal language independent and potentially capable to deal also with other aspects of prosody than intonation. 1

    Exploring efficient neural architectures for linguistic-acoustic mapping in text-to-speech

    Get PDF
    Conversion from text to speech relies on the accurate mapping from linguistic to acoustic symbol sequences, for which current practice employs recurrent statistical models such as recurrent neural networks. Despite the good performance of such models (in terms of low distortion in the generated speech), their recursive structure with intermediate affine transformations tends to make them slow to train and to sample from. In this work, we explore two different mechanisms that enhance the operational efficiency of recurrent neural networks, and study their performance–speed trade-off. The first mechanism is based on the quasi-recurrent neural network, where expensive affine transformations are removed from temporal connections and placed only on feed-forward computational directions. The second mechanism includes a module based on the transformer decoder network, designed without recurrent connections but emulating them with attention and positioning codes. Our results show that the proposed decoder networks are competitive in terms of distortion when compared to a recurrent baseline, whilst being significantly faster in terms of CPU and GPU inference time. The best performing model is the one based on the quasi-recurrent mechanism, reaching the same level of naturalness as the recurrent neural network based model with a speedup of 11.2 on CPU and 3.3 on GPU.Peer ReviewedPostprint (published version

    Overcoming the limitations of statistical parametric speech synthesis

    Get PDF
    At the time of beginning this thesis, statistical parametric speech synthesis (SPSS) using hidden Markov models (HMMs) was the dominant synthesis paradigm within the research community. SPSS systems are effective at generalising across the linguistic contexts present in training data to account for inevitable unseen linguistic contexts at synthesis-time, making these systems flexible and their performance stable. However HMM synthesis suffers from a ‘ceiling effect’ in the naturalness achieved, meaning that, despite great progress, the speech output is rarely confused for natural speech. There are many hypotheses for the causes of reduced synthesis quality, and subsequent required improvements, for HMM speech synthesis in literature. However, until this thesis, these hypothesised causes were rarely tested. This thesis makes two types of contributions to the field of speech synthesis; each of these appears in a separate part of the thesis. Part I introduces a methodology for testing hypothesised causes of limited quality within HMM speech synthesis systems. This investigation aims to identify what causes these systems to fall short of natural speech. Part II uses the findings from Part I of the thesis to make informed improvements to speech synthesis. The usual approach taken to improve synthesis systems is to attribute reduced synthesis quality to a hypothesised cause. A new system is then constructed with the aim of removing that hypothesised cause. However this is typically done without prior testing to verify the hypothesised cause of reduced quality. As such, even if improvements in synthesis quality are observed, there is no knowledge of whether a real underlying issue has been fixed or if a more minor issue has been fixed. In contrast, I perform a wide range of perceptual tests in Part I of the thesis to discover what the real underlying causes of reduced quality in HMM synthesis are and the level to which they contribute. Using the knowledge gained in Part I of the thesis, Part II then looks to make improvements to synthesis quality. Two well-motivated improvements to standard HMM synthesis are investigated. The first of these improvements follows on from averaging across differing linguistic contexts being identified as a major contributing factor to reduced synthesis quality. This is a practice typically performed during decision tree regression in HMM synthesis. Therefore a system which removes averaging across differing linguistic contexts and instead performs averaging only across matching linguistic contexts (called rich-context synthesis) is investigated. The second of the motivated improvements follows the finding that the parametrisation (i.e., vocoding) of speech, standard practice in SPSS, introduces a noticeable drop in quality before any modelling is even performed. Therefore the hybrid synthesis paradigm is investigated. These systems aim to remove the effect of vocoding by using SPSS to inform the selection of units in a unit selection system. Both of the motivated improvements applied in Part II are found to make significant gains in synthesis quality, demonstrating the benefit of performing the style of perceptual testing conducted in the thesis

    Modelling and detection of faults in axial-flux permanent magnet machines

    Get PDF
    The development of various topologies and configurations of axial-flux permanent magnet machine has spurred its use for electromechanical energy conversion in several applications. As it becomes increasingly deployed, effective condition monitoring built on reliable and accurate fault detection techniques is needed to ensure its engineering integrity. Unlike induction machine which has been rigorously investigated for faults, axial-flux permanent magnet machine has not. Thus in this thesis, axial-flux permanent magnet machine is investigated under faulty conditions. Common faults associated with it namely; static eccentricity and interturn short circuit are modelled, and detection techniques are established. The modelling forms a basis for; developing a platform for precise fault replication on a developed experimental test-rig, predicting and analysing fault signatures using both finite element analysis and experimental analysis. In the detection, the motor current signature analysis, vibration analysis and electrical impedance spectroscopy are applied. Attention is paid to fault-feature extraction and fault discrimination. Using both frequency and time-frequency techniques, features are tracked in the line current under steady-state and transient conditions respectively. Results obtained provide rich information on the pattern of fault harmonics. Parametric spectral estimation is also explored as an alternative to the Fourier transform in the steady-state analysis of faulty conditions. It is found to be as effective as the Fourier transform and more amenable to short signal-measurement duration. Vibration analysis is applied in the detection of eccentricities; its efficacy in fault detection is hinged on proper determination of vibratory frequencies and quantification of corresponding tones. This is achieved using analytical formulations and signal processing techniques. Furthermore, the developed fault model is used to assess the influence of cogging torque minimization techniques and rotor topologies in axial-flux permanent magnet machine on current signal in the presence of static eccentricity. The double-sided topology is found to be tolerant to the presence of static eccentricity unlike the single-sided topology due to the opposing effect of the resulting asymmetrical properties of the airgap. The cogging torque minimization techniques do not impair on the established fault detection technique in the single-sided topology. By applying electrical broadband impedance spectroscopy, interturn faults are diagnosed; a high frequency winding model is developed to analyse the impedance-frequency response obtained

    Are we using enough listeners? No! An empirically-supported critique of Interspeech 2014 TTS evaluations

    Get PDF
    Tallying the numbers of listeners that took part in subjective evaluations of synthetic speech at Interspeech 2014 showed that in more than 60 % of papers conclusions are based on listen-ing tests with less than 20 listeners. Our analysis of Blizzard 2013 data shows that for a MOS test measuring naturalness a stable level of significance is only reached when more than 30 listeners are used. In this paper, we set out a list of guidelines, i.e., a checklist for carrying out meaningful subjective evalua-tions. We further illustrate the importance of sentence coverage and number of listeners by presenting changes to rank order and number of significant pairs by re-analysing data from the Bliz-zard Challenge 2013. Index Terms: Subjective evaluation, text-to-speech, MOS tes

    Time-domain speech enhancement using generative adversarial networks

    Get PDF
    Speech enhancement improves recorded voice utterances to eliminate noise that might be impeding their intelligibility or compromising their quality. Typical speech enhancement systems are based on regression approaches that subtract noise or predict clean signals. Most of them do not operate directly on waveforms. In this work, we propose a generative approach to regenerate corrupted signals into a clean version by using generative adversarial networks on the raw signal. We also explore several variations of the proposed system, obtaining insights into proper architectural choices for an adversarially trained, convolutional autoencoder applied to speech. We conduct both objective and subjective evaluations to assess the performance of the proposed method. The former helps us choose among variations and better tune hyperparameters, while the latter is used in a listening experiment with 42 subjects, confirming the effectiveness of the approach in the real world. We also demonstrate the applicability of the approach for more generalized speech enhancement, where we have to regenerate voices from whispered signals.Peer ReviewedPostprint (author's final draft
    corecore