1,785 research outputs found

    Learning word vector representations based on acoustic counts

    Get PDF

    A Multi-Level Representation of f0 using the Continuous Wavelet Transform and the Discrete Cosine Transform

    Get PDF
    We propose a representation of f0 using the Continuous Wavelet Transform (CWT) and the Discrete Cosine Trans-form (DCT). The CWT decomposes the signal into various scales of selected frequencies, while the DCT compactly represents complex contours as a weighted sum of cosine functions. The proposed approach has the advantage of combining signal decomposition and higher-level represen-tations, thus modeling low-frequencies at higher levels and high-frequencies at lower-levels. Objective results indicate that this representation improves f0 prediction over tradi-tional short-term approaches. Subjective results show that improvements are seen over the typical MSD-HMM and are comparable to the recently proposed CWT-HMM, while us-ing less parameters. These results are discussed and future lines of research are proposed. Index Terms — prosody, HMM-based synthesis, f0 mod-eling, continuous wavelet transform, discrete cosine trans-form 1

    Speaker-Independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

    Get PDF
    Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously unseen speakers. In this work, we investigate the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted. We observe that models underperform when applied to data from speakers not seen at training time. However, when provided with minimal additional speaker information, such as the mean ultrasound frame, the models generalize better to unseen speakers.Comment: 5 pages, 4 figures, published in ICASSP2019 (IEEE International Conference on Acoustics, Speech and Signal Processing, 2019

    Silent versus modal multi-speaker speech recognition from ultrasound and video

    Get PDF
    We investigate multi-speaker speech recognition from ultrasound images of the tongue and video images of the lips. We train our systems on imaging data from modal speech, and evaluate on matched test sets of two speaking modes: silent and modal speech. We observe that silent speech recognition from imaging data underperforms compared to modal speech recognition, likely due to a speaking-mode mismatch between training and testing. We improve silent speech recognition performance using techniques that address the domain mismatch, such as fMLLR and unsupervised model adaptation. We also analyse the properties of silent and modal speech in terms of utterance duration and the size of the articulatory space. To estimate the articulatory space, we compute the convex hull of tongue splines, extracted from ultrasound tongue images. Overall, we observe that the duration of silent speech is longer than that of modal speech, and that silent speech covers a smaller articulatory space than modal speech. Although these two properties are statistically significant across speaking modes, they do not directly correlate with word error rates from speech recognition.Comment: 5 pages, 5 figures, Submitted to Interspeech 202

    Predicting pairwise preferences between TTS audio stimuli using parallel ratings data and anti-symmetric twin neural networks

    Get PDF
    Automatically predicting the outcome of subjective listening tests is a challenging task. Ratings may vary from person to person even if preferences are consistent across listeners. While previous work has focused on predicting listeners' ratings (mean opinion scores) of individual stimuli, we focus on the simpler task of predicting subjective preference given two speech stimuli for the same text. We propose a model based on anti-symmetric twin neural networks, trained on pairs of waveforms and their corresponding preference scores. We explore both attention and recurrent neural nets to account for the fact that stimuli in a pair are not time aligned. To obtain a large training set we convert listeners' ratings from MUSHRA tests to values that reflect how often one stimulus in the pair was rated higher than the other. Specifically, we evaluate performance on data obtained from twelve MUSHRA evaluations conducted over five years, containing different TTS systems, built from data of different speakers. Our results compare favourably to a state-of-the-art model trained to predict MOS scores

    A Study of Flow Characteristics Near a Channel Confluence Using CCHE 2D/3D Models

    Get PDF
    Source: ICHE Conference Archive - https://mdi-de.baw.de/icheArchiv

    Suprasegmental representations for the modeling of fundamental frequency in statistical parametric speech synthesis

    Get PDF
    Statistical parametric speech synthesis (SPSS) has seen improvements over recent years, especially in terms of intelligibility. Synthetic speech is often clear and understandable, but it can also be bland and monotonous. Proper generation of natural speech prosody is still a largely unsolved problem. This is relevant especially in the context of expressive audiobook speech synthesis, where speech is expected to be fluid and captivating. In general, prosody can be seen as a layer that is superimposed on the segmental (phone) sequence. Listeners can perceive the same melody or rhythm in different utterances, and the same segmental sequence can be uttered with a different prosodic layer to convey a different message. For this reason, prosody is commonly accepted to be inherently suprasegmental. It is governed by longer units within the utterance (e.g. syllables, words, phrases) and beyond the utterance (e.g. discourse). However, common techniques for the modeling of speech prosody - and speech in general - operate mainly on very short intervals, either at the state or frame level, in both hidden Markov model (HMM) and deep neural network (DNN) based speech synthesis. This thesis presents contributions supporting the claim that stronger representations of suprasegmental variation are essential for the natural generation of fundamental frequency for statistical parametric speech synthesis. We conceptualize the problem by dividing it into three sub-problems: (1) representations of acoustic signals, (2) representations of linguistic contexts, and (3) the mapping of one representation to another. The contributions of this thesis provide novel methods and insights relating to these three sub-problems. In terms of sub-problem 1, we propose a multi-level representation of f0 using the continuous wavelet transform and the discrete cosine transform, as well as a wavelet-based decomposition strategy that is linguistically and perceptually motivated. In terms of sub-problem 2, we investigate additional linguistic features such as text-derived word embeddings and syllable bag-of-phones and we propose a novel method for learning word vector representations based on acoustic counts. Finally, considering sub-problem 3, insights are given regarding hierarchical models such as parallel and cascaded deep neural networks

    Espelhando a mente: para uma análise do espaço psicológico em Lolita de Valdimir Nabokov e das suas manifestações visuais em Lolita: A screenplay de Vladimir Nabokov e em Lolita: The book of the film de Stephen Schiff

    Get PDF
    Tese de mestrado, Estudos Anglísticos, Universidade de Lisboa, Faculdade de Letras, 2010Como objectos de análise propõem-se a obra de Vladimir Nabokov, Lolita (1955) e os dois argumentos cinematográficos filmados, Lolita: A Screenplay (1974), de Vladimir Nabokov, que originou o filme de Stanley Kubrick (1962), e Lolita: The Book of the Film (1997), que resultou no filme de Adrian Lyne (1997). O cinema define-se sobretudo como um meio onde impera o domínio visual, pelo que o argumento, sendo parte integrante de um filme, partilha as suas características, distinguindo-se por isso por uma forte linguagem visual. A adaptação do romance ao domínio visual é relevante na medida em que acentua a narrativa como espaço predominantemente psicológico, em que a personagem sobressai em função do olhar. Numa primeira parte da dissertação, adoptando sobretudo uma perspectiva literária, sugere-se uma reflexão da obra baseada na questão do duplo e do jogo de espelhos à volta dos quais se constrói a identidade fragmentada, salientando os traços que posteriormente serão reformulados pela conversão de um espaço sobretudo literário para um de ordem visual. Numa segunda e terceira partes, os dois argumentos filmados de Lolita, o de Vladimir Nabokov (1974) e o de Stephen Schiff (1997), são abordados. Não se pretende, no entanto, explorar a viabilidade cinematográfica destes argumentos, mas realizar uma análise das escolhas efectuadas para se atingir eficazmente uma narração visual do espaço interior da personagem. A adaptação para argumento de uma obra literária que se centra na interioridade da personagem, e as escolhas e opções tomadas pelos argumentistas, de forma a possibilitar uma narração visual da história, serão o fio condutor destas duas últimas secções da dissertação.It is the aim of this dissertation to examine Vladimir Nabokov's Lolita (1955) and the two filmed screenplays: Lolita: A Screenplay (1974), written by Vladimir Nabokov, which was the foundation for Stanley Kubrick's film in 1962; and Lolita: The Book of the Film (1997), which was used to film Adrian Lyne's adaptation in 1997. Film is mainly a visual medium and the screenplay, as part of it, shares its characteristics, being defined by a strong visual language. The adaptation of the novel to a visual domain is relevant since it enforces narrative as a subjective world where character is governed by perspective. The first chapter of this dissertation focuses on Nabokov's novel from a literary point of view, exploring the double and the mirror techniques that are characteristic of the representation of a fragmented identity. The second and third chapters focus on the filmed screenplays, the first by Vladimir Nabokov, published in 1974, and the second by Stephen Schiff, written in 1997. The practicability of these screenplays will not be discussed as this dissertation's purpose is mainly on exploring the similarities and differences regarding Nabokov's novel and, above all, on how effectively each screenplay achieves a visual narration of the character's subjectivity. The last two chapters are therefore mainly concerned with the screenwriters' choices and options while adopting a literary text in order to achieve a strong visual narrative

    Parallel and cascaded deep neural networks for text-to-speech synthesis

    Get PDF
    corecore