482 research outputs found

    Explaining the PENTA model: a reply to Arvaniti and Ladd

    Get PDF
    This paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed

    Focus perception in Japanese: Effects of lexical accent and focus location.

    Get PDF
    This study explored the contexts in which native Japanese listeners have difficulty identifying prosodic focus. Using a 4AFC identification task, we compared native Japanese listeners' focus identification accuracy in different lexical accent × focus location conditions using resynthesised speech stimuli, which varied only in fundamental frequency. Experiment 1 compared the identification accuracy in lexical accent × focus location conditions using both natural and resynthesised stimuli. The results showed that focus identification rates were similar with the two stimulus types, thus establishing the reliability of the resynthesised stimuli. Experiment 2 explored these conditions further using only resynthesised stimuli. Narrow foci bearing the lexical pitch accent were always more correctly identified than unaccented ones, whereas the identification rate for final focus was the lowest among all focus locations. From these results, we argue that the difficulty of focus perception in Japanese is attributed to (i) the blocking of PFC by unaccented words, and (ii) similarity in F0 contours between lexical pitch accent and narrow focus, including in particular the similarity between downstep and PFC. Focus perception is therefore contingent on other concurrent communicative functions which may sometimes take precedence in a +PFC language

    THE ROLE OF “FOCUS OF ATTENTION” ON THE LEARNING OF NON-NATIVE SPEECH SOUNDS: ENGLISH SPEAKERS LEARNING OF MANDARIN CHINESE TONES

    Get PDF
    Focus of attention (FOA) has been demonstrated to affect motor learning and performance of many motor skills. FOA refers to the performer’s focus while performing the task. The purpose of this dissertation was to assess the role of FOA in the speech domain. The research asked whether external or internal FOA would individually or differentially facilitate the learning of Mandarin Chinese tones by native English speakers. As a secondary question and experimental control, this study also examined whether the four tones were produced with the same accuracy. Forty-two females, between the ages of 18 and 24 were randomly assigned to one of three groups: external FOA (EFOA), internal FOA (IFOA) and control (C). During the acquisition phase, the groups were instructed to either focus on the sound produced (EFOA), the vibration in the voice box (IFOA), or no related FOA instructions (control). Participants were required to repeat the Mandarin words after an auditory model. To assess learning, the participants repeated the practiced words in a retention test, and repeated similar but unpracticed words during a transfer test. The data was collected in two sessions. The dependent variables were the root mean squared error (acoustic measure) and percentage of correctly perceived tones (perceptual measure). There was a significant difference among the four Mandarin Chinese tones for the three groups (Tones 1 and 4 were produced with significantly higher accuracy than Tones 2 and 3) before acquisition phase. There was, however, no significant difference among the three FOA groups on the dependent variables. The results contradict the FOA effects in the literature derived from limb motor learning and oral-nonspeech learning experiments. This study represents the first attempt to test the FOA in the speech domain. As such, it is premature to draw firm conclusions about the role of FOA in speech motor learning based on these results. The discussion focuses on factors that might have led to the current results. Because FOA represents a potential factor that might affect speech motor learning, future research is warranted to study the effect of FOA in the speech domain

    Improving the Speech Intelligibility By Cochlear Implant Users

    Get PDF
    In this thesis, we focus on improving the intelligibility of speech for cochlear implants (CI) users. As an auditory prosthetic device, CI can restore hearing sensations for most patients with profound hearing loss in both ears in a quiet background. However, CI users still have serious problems in understanding speech in noisy and reverberant environments. Also, bandwidth limitation, missing temporal fine structures, and reduced spectral resolution due to a limited number of electrodes are other factors that raise the difficulty of hearing in noisy conditions for CI users, regardless of the type of noise. To mitigate these difficulties for CI listener, we investigate several contributing factors such as the effects of low harmonics on tone identification in natural and vocoded speech, the contribution of matched envelope dynamic range to the binaural benefits and contribution of low-frequency harmonics to tone identification in quiet and six-talker babble background. These results revealed several promising methods for improving speech intelligibility for CI patients. In addition, we investigate the benefits of voice conversion in improving speech intelligibility for CI users, which was motivated by an earlier study showing that familiarity with a talker’s voice can improve understanding of the conversation. Research has shown that when adults are familiar with someone’s voice, they can more accurately – and even more quickly – process and understand what the person is saying. This theory identified as the “familiar talker advantage” was our motivation to examine its effect on CI patients using voice conversion technique. In the present research, we propose a new method based on multi-channel voice conversion to improve the intelligibility of transformed speeches for CI patients

    Relations between music and speech from the perspectives of dynamics, timbre and pitch

    Get PDF
    Despite the vast amount of scholarly effort to compare music and speech from a wide range of perspectives, some of the most fundamental aspects of music and speech still remain unexplored. This PhD thesis tackles three aspects essential to the understanding of the relations between music and speech: dynamics, timbre and pitch. In terms of dynamics, previous research has used perception experiments where dynamics is represented by acoustic intensity, with little attention to the fact that dynamics is an important mechanism of motor movements in both music performance and speech production. Therefore, the first study of this thesis compared the dynamics of music and speech using production experiments with a focus on motor movements: finger force in affective piano performance was used as an index of music dynamics and articulatory effort in affective Mandarin speech was used as an index of speech dynamics. The results showed both similarities and differences between the two domains. With regard to timbre, there has been a long-held observation that the timbre of musical instruments mimics human voice, particularly in terms of conveying emotions. However, little research has been done to empirically investigate the emotional connotations of the timbre of isolated sounds of musical instruments in relation to affective human speech. Hence, the second study explored this issue using behavioral and ERP methods. The results largely supported previous observations, although some fundamental differences also existed. In terms of pitch, some studies have mentioned that music could have close relations with speech with regard to pitch prominence and expectation patterns. Nevertheless, the functional differences of pitch in music and speech could also imply that speech does not necessarily follow the same pitch patterns as music in conveying prominence and expectation. So far there is little empirical evidence to either support or refute the aforementioned observations. Hence the third study examined this issue. The results showed the differences outweighed the similarities between music and speech in terms of pitch prominence and expectation. In conclusion, from three perspectives essential to music and speech, this thesis has shed new light on the overlapping yet distinct relations between the two domains

    Explaining the PENTA mode: A reply to Arvaniti and Ladd (2009)

    Get PDF
    his paper presents an overview of the Parallel Encoding and Target Approximation (PENTA) model of speech prosody, in response to an extensive critique by Arvaniti & Ladd (2009). PENTA is a framework for conceptually and computationally linking communicative meanings to fine-grained prosodic details, based on an articulatory-functional view of speech. Target Approximation simulates the articulatory realisation of underlying pitch targets – the prosodic primitives in the framework. Parallel Encoding provides an operational scheme that enables simultaneous encoding of multiple communicative functions. We also outline how PENTA can be computationally tested with a set of software tools. With the help of one of the tools, we offer a PENTA-based hypothetical account of the Greek intonational patterns reported by Arvaniti & Ladd, showing how it is possible to predict the prosodic shapes of an utterance based on the lexical and postlexical meanings it conveys

    The dynamics of Japanese prosody

    Get PDF
    This dissertation explores aspects of Tokyo Japanese (Japanese henceforth) prosody through acoustic analysis and analysis-by-synthesis. It 1) revisits existing issues in Japanese prosody with the minimal use of abstract notions and 2) tests if the Parallel Encoding and Target Approximation (Xu, 2005) framework is suitable for Japanese, a pitch accent language. The first part of the dissertation considers the nature of lexical pitch accent through examining factors that affect the surface F0 realisation of an accent peak (Chapter 2) and establishing the articulatory domain that hosts a tonal target in Japanese (Chapter 3). Next, pitch accent interactions with other communicative functions are considered, specifically in terms of focus (Chapter 4) and sentence type (Chapter 5). Hypotheses using acoustic analyses from the previous Chapters are then verified through analysis-by-synthesis with articulatory synthesisers AMtrainer, PENTAtrainer1, and PENTAtrainer2 (Chapter 6). Chapter 2 provides conclusive evidence of Japanese as a two-tone language as opposed to bearing three underlying tones in its phonology, previously unresolved in existing literature. Proponents of the two-tone hypothesis gather evidence from perception: when stimuli are played in isolation, native listeners can only distinguish two tone levels (High and Low). On the other hand, production evidence reveals robustly three distinct surface F0 levels. Using a series of linear regression analyses, I show the third tone level could be interpreted as a result of pre-low raising, a common articulatory phenomenon. The F0 of an accent peak is inversely correlated with the F0 of the following low target, being an enhanced peak in preparation for the upcoming L. Interpreted together with native listeners’ inability to hear three tones when said in isolation, as repeatedly reported in previous studies, I establish Japanese has only H and L in its tonal inventory. Chapter 3 establishes the syllable as the tone-bearing unit in Japanese tonal articulation. Often described as a mora-timed language, it has been previously unclear whether articulatory tonal targets are hosted in a mora or a syllable in Japanese. When comparing accented words of various syllable structures I found that the F0 accent peak of CVCV wordsoccurs consistently earlier than that of CVn/CVCV words. CVCV words are longer in total duration so its earlier F0 peak is a result of a shorter tone-bearing unit (i.e. two consecutive short morae/syllables). CVn/CVV words on the other hand have a later peak F0 due to hosting an articulatory target as a long syllable, rather than two short morae. I further verified the syllable hypothesis using two articulatory synthesisers, PENTAtrainer1 and PENTAtrainer2. The syllable as a tone-bearing unit incurs fewer predictors but provides better learning accuracy. Chapter 4 explores focus prosody in declarative sentences. Using a newly collected corpus of 6251 sentences that controls for accent condition, focus condition, sentence type, and sentence length, I challenge the widely held idea that post-focus compression of F0 range is accent-independent. Currently it is generally accepted that regardless of the accent condition of the focused word, the excursion size of ‘initial rise’ that marks the beginning of the first word 4 after focus is shrunken. However, confining the notion of post-focus compression to initial-rise (usually extending across only two morae) sets Japanese apart from other languages like English or Mandarin, where such compression is robust across the entire post-focus domain. I show that when F0 range is measured across a wider domain, compression is absent. Where post-focus compression is absent, the F0 trajectory appears to be a result of articulatory carryover effects. This will be interpreted as a result of weak articulatory strength on the post focus domain, explaining the difference in F0 trajectories in long and short utterances. Chapter 5 builds on the previous Chapter to consider in addition the focus prosody in yes/no questions. I investigate what marks a yes/no question, and how focus prosody differs in declarative and interrogative utterances. Acoustic analyses show that questions are marked by a final rise, but the exact shape of such a rise depends on the accent condition of the sentence-final word. When compared to declarative sentences, the key differences in yes/no questions include: a higher F0 level; the absence of post-focus compression even in contexts otherwise observed in statements; and on-focus F0 raising as the only robust focus marker. These findings point to the fact that interrogative focus prosody is not an amalgamation of focus markers and question markers, and bear implication on the representation of Japanese intonation. Chapter 6 verifies observations established thus far through analysis-by-synthesis. I demonstrate comparative modeling as a means to adjudicate between competing theories using PENTAtrainer2, PENTAtrainer1 and AMtrainer. In terms of local fitting accuracy, AMtrainer yielded comparable synthesis accuracy to the PENTAtrainers. Finally, I further demonstrate the compatibility of PENTA with Japanese prosody showing highly accurate F0 predictive analysis (when trained with Chapter 2 production data), and highly satisfactory speaker-dependent synthesis accuracy (when trained with Chapter 4 and 5 sentential data). Naturalness judgment ratings show that the natural stimuli sound as natural as the synthetic stimuli, though questions generally sound less natural than statements. Reasons for this discrepancy are discussed with reference to the design of the stimuli
    • 

    corecore