277 research outputs found

    Explaining the PENTA model: a reply to Arvaniti and Ladd

    Get PDF
    This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed

    The dynamics of Japanese prosody

    Get PDF
    This dissertation explores aspects of Tokyo Japanese (Japanese henceforth) prosody through acoustic analysis and analysis-by-synthesis. It 1) revisits existing issues in Japanese prosody with the minimal use of abstract notions and 2) tests if the Parallel Encoding and Target Approximation (Xu, 2005) framework is suitable for Japanese, a pitch accent language. The first part of the dissertation considers the nature of lexical pitch accent through examining factors that affect the surface F0 realisation of an accent peak (Chapter 2) and establishing the articulatory domain that hosts a tonal target in Japanese (Chapter 3). Next, pitch accent interactions with other communicative functions are considered, specifically in terms of focus (Chapter 4) and sentence type (Chapter 5). Hypotheses using acoustic analyses from the previous Chapters are then verified through analysis-by-synthesis with articulatory synthesisers AMtrainer, PENTAtrainer1, and PENTAtrainer2 (Chapter 6). Chapter 2 provides conclusive evidence of Japanese as a two-tone language as opposed to bearing three underlying tones in its phonology, previously unresolved in existing literature. Proponents of the two-tone hypothesis gather evidence from perception: when stimuli are played in isolation, native listeners can only distinguish two tone levels (High and Low). On the other hand, production evidence reveals robustly three distinct surface F0 levels. Using a series of linear regression analyses, I show the third tone level could be interpreted as a result of pre-low raising, a common articulatory phenomenon. The F0 of an accent peak is inversely correlated with the F0 of the following low target, being an enhanced peak in preparation for the upcoming L. Interpreted together with native listeners’ inability to hear three tones when said in isolation, as repeatedly reported in previous studies, I establish Japanese has only H and L in its tonal inventory. Chapter 3 establishes the syllable as the tone-bearing unit in Japanese tonal articulation. Often described as a mora-timed language, it has been previously unclear whether articulatory tonal targets are hosted in a mora or a syllable in Japanese. When comparing accented words of various syllable structures I found that the F0 accent peak of CVCV wordsoccurs consistently earlier than that of CVn/CVCV words. CVCV words are longer in total duration so its earlier F0 peak is a result of a shorter tone-bearing unit (i.e. two consecutive short morae/syllables). CVn/CVV words on the other hand have a later peak F0 due to hosting an articulatory target as a long syllable, rather than two short morae. I further verified the syllable hypothesis using two articulatory synthesisers, PENTAtrainer1 and PENTAtrainer2. The syllable as a tone-bearing unit incurs fewer predictors but provides better learning accuracy. Chapter 4 explores focus prosody in declarative sentences. Using a newly collected corpus of 6251 sentences that controls for accent condition, focus condition, sentence type, and sentence length, I challenge the widely held idea that post-focus compression of F0 range is accent-independent. Currently it is generally accepted that regardless of the accent condition of the focused word, the excursion size of ‘initial rise’ that marks the beginning of the first word 4 after focus is shrunken. However, confining the notion of post-focus compression to initial-rise (usually extending across only two morae) sets Japanese apart from other languages like English or Mandarin, where such compression is robust across the entire post-focus domain. I show that when F0 range is measured across a wider domain, compression is absent. Where post-focus compression is absent, the F0 trajectory appears to be a result of articulatory carryover effects. This will be interpreted as a result of weak articulatory strength on the post focus domain, explaining the difference in F0 trajectories in long and short utterances. Chapter 5 builds on the previous Chapter to consider in addition the focus prosody in yes/no questions. I investigate what marks a yes/no question, and how focus prosody differs in declarative and interrogative utterances. Acoustic analyses show that questions are marked by a final rise, but the exact shape of such a rise depends on the accent condition of the sentence-final word. When compared to declarative sentences, the key differences in yes/no questions include: a higher F0 level; the absence of post-focus compression even in contexts otherwise observed in statements; and on-focus F0 raising as the only robust focus marker. These findings point to the fact that interrogative focus prosody is not an amalgamation of focus markers and question markers, and bear implication on the representation of Japanese intonation. Chapter 6 verifies observations established thus far through analysis-by-synthesis. I demonstrate comparative modeling as a means to adjudicate between competing theories using PENTAtrainer2, PENTAtrainer1 and AMtrainer. In terms of local fitting accuracy, AMtrainer yielded comparable synthesis accuracy to the PENTAtrainers. Finally, I further demonstrate the compatibility of PENTA with Japanese prosody showing highly accurate F0 predictive analysis (when trained with Chapter 2 production data), and highly satisfactory speaker-dependent synthesis accuracy (when trained with Chapter 4 and 5 sentential data). Naturalness judgment ratings show that the natural stimuli sound as natural as the synthetic stimuli, though questions generally sound less natural than statements. Reasons for this discrepancy are discussed with reference to the design of the stimuli

    Explaining the PENTA mode: A reply to Arvaniti and Ladd (2009)

    Get PDF
    his paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Pre-Low Raising in Japanese Pitch Accent

    Get PDF
    Japanese has been observed to have 2 versions of the H tone, the higher of which is associated with an accented mora. However, the distinction of these 2 versions only surfaces in context but not in isolation, leading to a long-standing debate over whether there is 1 H tone or 2. This article reports evidence that the higher version may result from a pre-low raising mechanism rather than being inherently higher. The evidence is based on an analysis of F0 of words that varied in length, accent condition and syllable structure, produced by native speakers of Japanese at 2 speech rates. The data indicate a clear separation between effects that are due to mora-level preplanning and those that are mechanical. These results are discussed in terms of mechanisms of laryngeal control during tone production, and highlight the importance of articulation as a link between phonology and surface acoustics.postprin

    How Movie Dubbing Can Help Native Chinese Speakers’ English Pronunciation

    Get PDF
    The purpose of this study was to determine if the use of English movie scripts and movie dubbing activities can help native Chinese speakers improve their awareness of prosodic features in English, specifically, sentence stress. The literature review explores Chinese and English prosody, movie dubbing and ideal pronunciation standards. The qualitative research paradigm was implemented to explore the hypothesis that hearing and mimicking the natural speech patterns of native speakers can help native Chinese speakers improve their awareness of sentence stress in English. After three cycles of language instruction and language discrimination activities, seven students were chosen for a case study. Data collected from their responses to activities and questionnaires were analyzed. The results indicate that these students’ actual ability to hear sentence stress is greater than their theoretical awareness of sentence stress rules. The author concludes with recommendations for adapting movie dubbing activities and suggestions for future research

    Multiple prosodic meanings are conveyed through separate pitch ranges: Evidence from perception of focus and surprise in Mandarin Chinese

    Get PDF
    F0 variation is a crucial feature in speech prosody, which can convey linguistic information such as focus and paralinguistic meanings such as surprise. How can multiple layers of information be represented with F0 in speech: are they divided into discrete layers of pitch or overlapped without clear divisions? We investigated this question by assessing pitch perception of focus and surprise in Mandarin Chinese. Seventeen native Mandarin listeners rated the strength of focus and surprise conveyed by the same set of synthetically manipulated sentences. An fMRI experiment was conducted to assess neural correlates of the listeners’ perceptual response to the stimuli. The results showed that behaviourally, the perceptual threshold for focus was 3 semitones and that for surprise was 5 semitones above the baseline. Moreover, the pitch range of 5-12 semitones above the baseline signalled both focus and surprise, suggesting a considerable overlap between the two types of prosodic information within this range. The neuroimaging data positively correlated with the variations in behavioural data. Also, a ceiling effect was found as no significant behavioural differences or neural activities were shown after reaching a certain pitch level for the perception of focus and surprise respectively. Together, the results suggest that different layers of prosodic information are represented in F0 through different pitch ranges: paralinguistic information is represented at a pitch range beyond that used by linguistic information. Meanwhile, the representation of paralinguistic information is achieved without obscuring linguistic prosody, thus allowing F0 to represent the two layers of information in parallel
    • …
    corecore