2,302 research outputs found

    Median-based generation of synthetic speech durations using a non-parametric approach

    Get PDF
    This paper proposes a new approach to duration modelling for statistical parametric speech synthesis in which a recurrent statistical model is trained to output a phone transition probability at each timestep (acoustic frame). Unlike conventional approaches to duration modelling -- which assume that duration distributions have a particular form (e.g., a Gaussian) and use the mean of that distribution for synthesis -- our approach can in principle model any distribution supported on the non-negative integers. Generation from this model can be performed in many ways; here we consider output generation based on the median predicted duration. The median is more typical (more probable) than the conventional mean duration, is robust to training-data irregularities, and enables incremental generation. Furthermore, a frame-level approach to duration prediction is consistent with a longer-term goal of modelling durations and acoustic features together. Results indicate that the proposed method is competitive with baseline approaches in approximating the median duration of held-out natural speech.Comment: 7 pages, 1 figure -- Accepted for presentation at IEEE Workshop on Spoken Language Technology (SLT 2016

    Prosody generation for text-to-speech synthesis

    Get PDF
    The absence of convincing intonation makes current parametric speech synthesis systems sound dull and lifeless, even when trained on expressive speech data. Typically, these systems use regression techniques to predict the fundamental frequency (F0) frame-by-frame. This approach leads to overlysmooth pitch contours and fails to construct an appropriate prosodic structure across the full utterance. In order to capture and reproduce larger-scale pitch patterns, we propose a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates (from a small, automatically learned set) are predicted by a recurrent neural network (RNN). The use of syllable templates mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. The use of an RNN, paired with connectionist temporal classification (CTC), enables the prediction of structure in the pitch contour spanning the entire utterance. This novel F0 prediction system is used alongside separate LSTMs for predicting phone durations and the other acoustic features, to construct a complete text-to-speech system. Later, we investigate the benefits of including long-range dependencies in duration prediction at frame-level using uni-directional recurrent neural networks. Since prosody is a supra-segmental property, we consider an alternate approach to intonation generation which exploits long-term dependencies of F0 by effective modelling of linguistic features using recurrent neural networks. For this purpose, we propose a hierarchical encoder-decoder and multi-resolution parallel encoder where the encoder takes word and higher level linguistic features at the input and upsamples them to phone-level through a series of hidden layers and is integrated into a Hybrid system which is then submitted to Blizzard challenge workshop. We then highlight some of the issues in current approaches and a plan for future directions of investigation is outlined along with on-going work

    A Hierarchical Encoder-Decoder Model for Statistical Parametric Speech Synthesis

    Get PDF

    Evaluating pause particles and their functions in natural and synthesized speech in laboratory and lecture settings

    Get PDF
    Pause-internal phonetic particles (PINTs) comprise a variety of phenomena including: phonetic-acoustic silence, inhalation and exhalation breath noises, filler particles “uh” and “um” in English, tongue clicks, and many others. These particles are omni-present in spontaneous speech, however, they are under-researched in both natural speech and synthetic speech. The present work explores the influence of PINTs in small-context recall experiments, develops a bespoke speech synthesis system that incorporates the PINTs pattern of a single speaker, and evaluates the influence of PINTs on recall for larger material lengths, namely university lectures. The benefit of PINTs on recall has been documented in natural speech in small-context laboratory settings, however, this area of research has been under-explored for synthetic speech. We devised two experiments to evaluate if PINTs have the same recall benefit for synthetic material that is found with natural material. In the first experiment, we evaluated the recollection of consecutive missing digits for a randomized 7-digit number. Results indicated that an inserted silence improved recall accuracy for digits immediately following. In the second experiment, we evaluated sentence recollection. Results indicated that sentences preceded by an inhalation breath noise were better recalled than those with no inhalation. Together, these results reveal that in single-sentence laboratory settings PINTs can improve recall for synthesized speech. The speech synthesis systems used in the small-context recall experiments did not provide much freedom in terms of controlling PINT type or location. Therefore, we endeavoured to develop bespoke speech synthesis systems. Two neural text-to-speech (TTS) systems were created: one that used PINTs annotation labels in the training data, and another that did not include any PINTs labeling in the training material. The first system allowed fine-tuned control for inserting PINTs material into the rendered material. The second system produced PINTs probabilistally. To the best of our knowledge, these are the first TTS systems to render tongue clicks. Equipped with greater control of synthesized PINTs, we returned to evaluating the recall benefit of PINTs. This time we evaluated the influence of PINTs on the recollection of key information in lectures, an ecologically valid task that focused on larger material lengths. Results indicated that key information that followed PINTs material was less likely to be recalled. We were unable to replicate the benefits of PINTs found in the small-context laboratory settings. This body of work showcases that PINTs improve recall for TTS in small-context environments just like previous work had indicated for natural speech. Additionally, we’ve provided a technological contribution via a neural TTS system that exerts finer control over PINT type and placement. Lastly, we’ve shown the importance of using material rendered by speech synthesis systems in perceptual studies.This research was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) within the project “Pause-internal phonetic particles in speech communication” (project number: 418659027; project IDs: MO 597/10-1 and TR 468/3-1). Associate member of SFB1102 “Information Density and Linguistic Encoding” (project number: 232722074)

    Contributions to Collective Dynamical Clustering-Modeling of Discrete Time Series

    Get PDF
    The analysis of sequential data is important in business, science, and engineering, for tasks such as signal processing, user behavior mining, and commercial transactions analysis. In this dissertation, we build upon the Collective Dynamical Modeling and Clustering (CDMC) framework for discrete time series modeling, by making contributions to clustering initialization, dynamical modeling, and scaling. We first propose a modified Dynamic Time Warping (DTW) approach for clustering initialization within CDMC. The proposed approach provides DTW metrics that penalize deviations of the warping path from the path of constant slope. This reduces over-warping, while retaining the efficiency advantages of global constraint approaches, and without relying on domain dependent constraints. Second, we investigate the use of semi-Markov chains as dynamical models of temporal sequences in which state changes occur infrequently. Semi-Markov chains allow explicitly specifying the distribution of state visit durations. This makes them superior to traditional Markov chains, which implicitly assume an exponential state duration distribution. Third, we consider convergence properties of the CDMC framework. We establish convergence by viewing CDMC from an Expectation Maximization (EM) perspective. We investigate the effect on the time to convergence of our efficient DTW-based initialization technique and selected dynamical models. We also explore the convergence implications of various stopping criteria. Fourth, we consider scaling up CDMC to process big data, using Storm, an open source distributed real-time computation system that supports batch and distributed data processing. We performed experimental evaluation on human sleep data and on user web navigation data. Our results demonstrate the superiority of the strategies introduced in this dissertation over state-of-the-art techniques in terms of modeling quality and efficiency

    Cross-Lingual Neural Network Speech Synthesis Based on Multiple Embeddings

    Get PDF
    The paper presents a novel architecture and method for speech synthesis in multiple languages, in voices of multiple speakers and in multiple speaking styles, even in cases when speech from a particular speaker in the target language was not present in the training data. The method is based on the application of neural network embedding to combinations of speaker and style IDs, but also to phones in particular phonetic contexts, without any prior linguistic knowledge on their phonetic properties. This enables the network not only to efficiently capture similarities and differences between speakers and speaking styles, but to establish appropriate relationships between phones belonging to different languages, and ultimately to produce synthetic speech in the voice of a certain speaker in a language that he/she has never spoken. The validity of the proposed approach has been confirmed through experiments with models trained on speech corpora of American English and Mexican Spanish. It has also been shown that the proposed approach supports the use of neural vocoders, i.e. that they are able to produce synthesized speech of good quality even in languages that they were not trained on

    Glottal Spectral Separation for Speech Synthesis

    Get PDF

    Concatenative speech synthesis: a Framework for Reducing Perceived Distortion when using the TD-PSOLA Algorithm

    Get PDF
    This thesis presents the design and evaluation of an approach to concatenative speech synthesis using the Titne-Domain Pitch-Synchronous OverLap-Add (I'D-PSOLA) signal processing algorithm. Concatenative synthesis systems make use of pre-recorded speech segments stored in a speech corpus. At synthesis time, the `best' segments available to synthesise the new utterances are chosen from the corpus using a process known as unit selection. During the synthesis process, the pitch and duration of these segments may be modified to generate the desired prosody. The TD-PSOLA algorithm provides an efficient and essentially successful solution to perform these modifications, although some perceptible distortion, in the form of `buzzyness', may be introduced into the speech signal. Despite the popularity of the TD-PSOLA algorithm, little formal research has been undertaken to address this recognised problem of distortion. The approach in the thesis has been developed towards reducing the perceived distortion that is introduced when TD-PSOLA is applied to speech. To investigate the occurrence of this distortion, a psychoacoustic evaluation of the effect of pitch modification using the TD-PSOLA algorithm is presented. Subjective experiments in the form of a set of listening tests were undertaken using word-level stimuli that had been manipulated using TD-PSOLA. The data collected from these experiments were analysed for patterns of co- occurrence or correlations to investigate where this distortion may occur. From this, parameters were identified which may have contributed to increased distortion. These parameters were concerned with the relationship between the spectral content of individual phonemes, the extent of pitch manipulation, and aspects of the original recordings. Based on these results, a framework was designed for use in conjunction with TD-PSOLA to minimise the possible causes of distortion. The framework consisted of a novel speech corpus design, a signal processing distortion measure, and a selection process for especially problematic phonemes. Rather than phonetically balanced, the corpus is balanced to the needs of the signal processing algorithm, containing more of the adversely affected phonemes. The aim is to reduce the potential extent of pitch modification of such segments, and hence produce synthetic speech with less perceptible distortion. The signal processingdistortion measure was developed to allow the prediction of perceptible distortion in pitch-modified speech. Different weightings were estimated for individual phonemes,trained using the experimental data collected during the listening tests.The potential benefit of such a measure for existing unit selection processes in a corpus-based system using TD-PSOLA is illustrated. Finally, the special-case selection process was developed for highly problematic voiced fricative phonemes to minimise the occurrence of perceived distortion in these segments. The success of the framework, in terms of generating synthetic speech with reduced distortion, was evaluated. A listening test showed that the TD-PSOLA balanced speech corpus may be capable of generating pitch-modified synthetic sentences with significantly less distortion than those generated using a typical phonetically balanced corpus. The voiced fricative selection process was also shown to produce pitch-modified versions of these phonemes with less perceived distortion than a standard selection process. The listening test then indicated that the signal processing distortion measure was able to predict the resulting amount of distortion at the sentence-level after the application of TD-PSOLA, suggesting that it may be beneficial to include such a measure in existing unit selection processes. The framework was found to be capable of producing speech with reduced perceptible distortion in certain situations, although the effects seen at the sentence-level were less than those seen in the previous investigative experiments that made use of word-level stimuli. This suggeststhat the effect of the TD-PSOLA algorithm cannot always be easily anticipated due to the highly dynamic nature of speech, and that the reduction of perceptible distortion in TD-PSOLA-modified speech remains a challenge to the speech community
    corecore