81 research outputs found

    A weighted superposition of functional contours model for modelling contextual prominence of elementary prosodic contours

    Get PDF
    The way speech prosody encodes linguistic, paralinguistic and non-linguistic information via multiparametric representations of the speech signals is still an open issue. The Superposition of Functional Contours (SFC) model proposes to decompose prosody into elementary multiparametric functional contours through the iterative training of neural network contour generators using analysis-by-synthesis. Each generator is responsible for computing multiparametric contours that encode one given linguistic, paralinguistic and non-linguistic information on a variable scope of rhythmic units. The contributions of all generators' outputs are then overlapped and added to produce the prosody of the utterance. We propose an extension of the contour generators that allows them to model the prominence of the elementary contours based on contextual information. WSFC jointly learns the patterns of the elementary multiparametric functional contours and their weights dependent on the contours' contexts. The experimental results show that the proposed weighted SFC (WSFC) model can successfully capture contour prominence and thus improve SFC modelling performance. The WSFC is also shown to be effective at modelling the impact of attitudes on the prominence of functional contours cuing syntactic relations in French, and that of emphasis on the prominence of tone contours in Chinese

    Non-Native Differences in Prosodic-Construction Use

    Get PDF
    Many language learners never acquire truly native-sounding prosody. Previous work has suggested that this involves skill deficits in the dialog-related uses of prosody, and may be attributable to weaknesses with specific prosodic constructions. Using semi-automated methods, we identified 32 of the most common prosodic constructions in English dialog. Examining 90 minutes of six advanced native-Spanish learners conversing in English, there were differences, notably regarding swift turn-taking, alignment, and empathy, but overall their uses of prosodic constructions were largely similar to those of native speakers

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed

    Film Dialogue Translation And The Intonation Unit : Towards Equivalent Effect In English And Chinese

    Get PDF
    This thesis proposes a new approach to film dialogue translation (FDT) with special reference to the translation process and quality of English-to-Chinese dubbing. In response to the persistent translation failures that led to widespread criticism of dubbed films and TV plays in China for their artificial \u27translation talk\u27, this study provides a pragmatic methodology derived from the integration of the theories and analytical systems of information flow in the tradition of the functionalist approach to speech and writing with the relevant theoretical and empirical findings from TS and other related branches of linguistics. It has developed and validated a translation model (FITNIATS) which makes the intonation unit (IU) the central unit of film dialogue translation. Arguing that any translation which treats dubbing as a simple script-to-script process, without transferring the prosodic properties of the spoken words into the commensurate functions of TL, is incomplete, the thesis demonstrates that, in order to reduce confusion and loss of meaning/rhythm, the SL dialogue should be rendered in the IUs with the stressed syllables well-timed in TL to keep the corresponding information foci in sync with the visual message. It shows that adhering to the sentence-to-sentence formula as the translation metastrategy with the information structure of the original film dialogue permuted can result in serious stylistic as well as communicative problems. Five key theoretical issues in TS are addressed in the context of FDT, viz., the relations between micro-structure and macro-structure translation perspectives, foreignizing vs. domesticating translation, the unit of translation, the levels of translation equivalence and the criteria for evaluating translation quality. lf equivalent effect is to be achieved in all relevant dimensions, it is argued that \u27FITness criteria\u27 need to be met in film translation assessment, and four such criteria arc proposed. This study demonstrates that prosody and word order, as sensitive indices of the information flow which occurs in film dialogue through the creation and perception of meaning, can provide a basis for minimizing cross-linguistic discrepancies and compensating for loss of the FIT functions, especially where conflicts arise between the syntactic and/or medium constraints and the adequate transfer of cultural-specific content and style. The implications of the model for subtitling arc also made explicit

    Intonation in a text-to-speech conversion system

    Get PDF

    Fundamental frequency modelling: an articulatory perspective with target approximation and deep learning

    Get PDF
    Current statistical parametric speech synthesis (SPSS) approaches typically aim at state/frame-level acoustic modelling, which leads to a problem of frame-by-frame independence. Besides that, whichever learning technique is used, hidden Markov model (HMM), deep neural network (DNN) or recurrent neural network (RNN), the fundamental idea is to set up a direct mapping from linguistic to acoustic features. Although progress is frequently reported, this idea is questionable in terms of biological plausibility. This thesis aims at addressing the above issues by integrating dynamic mechanisms of human speech production as a core component of F0 generation and thus developing a more human-like F0 modelling paradigm. By introducing an articulatory F0 generation model – target approximation (TA) – between text and speech that controls syllable-synchronised F0 generation, contextual F0 variations are processed in two separate yet integrated stages: linguistic to motor, and motor to acoustic. With the goal of demonstrating that human speech movement can be considered as a dynamic process of target approximation and that the TA model is a valid F0 generation model to be used at the motor-to-acoustic stage, a TA-based pitch control experiment is conducted first to simulate the subtle human behaviour of online compensation for pitch-shifted auditory feedback. Then, the TA parameters are collectively controlled by linguistic features via a deep or recurrent neural network (DNN/RNN) at the linguistic-to-motor stage. We trained the systems on a Mandarin Chinese dataset consisting of both statements and questions. The TA-based systems generally outperformed the baseline systems in both objective and subjective evaluations. Furthermore, the amount of required linguistic features were reduced first to syllable level only (with DNN) and then with all positional information removed (with RNN). Fewer linguistic features as input with limited number of TA parameters as output led to less training data and lower model complexity, which in turn led to more efficient training and faster synthesis
    corecore