271 research outputs found

    The Use of F0 Reliability Function for Prosodic Command Analysis on F0 Contour Generation Model

    Get PDF
    This paper describes a method of utilizing an ``F0 Reliability Field'' (FRF), which we have proposed in our previous work, for estimating prosodic commands on F0 contour generation model. This FRF is the time-frequency representation of F0 likelihood, and an advantage of FRF is that it is not necessary to consider F0 errors that occur during an automatic F0 determination. Therefore, it is thought that FRF can be a more useful feature for automatic prosody analyses than F0 contour, and our previous paper has reported the validity of FRF on the analysis of detecting prosodic boundaries in Japanese continuous speech. Moreover, in this paper, we have examined the validity on the prosodic command estimation of superpositional model. Experimental results show that the accuracy of command estimation with FRF is well and it is close to the accuracy of command estimation with ideal F0 contour that has no F0 error

    Automatic Prosodic Segmentation by F0 Clustering Using Superpositional Modeling.

    Get PDF
    In this paper, we propose an automatic method for detecting accent phrase boundaries in Japanese continuous speech by using F0 information. In the training phase, hand labeled accent patterns are parameterized according to a superpositional model proposed by Fujisaki, and assigned to some clusters by a clustering method, in which accent templates are calculated as centroid of each cluster. In the segmentation phase, automatic N-best extraction of boundaries is performed by One-Stage DP matching between the reference templates and the target F0 contour. About 90% of accent phrase boundaries were correctly detected in speaker independent experiments with the ATR Japanese continuous speech database

    How tone, intonation and emotion shape the development of infants' fundamental frequency perception

    Get PDF
    Fundamental frequency (ƒ0), perceived as pitch, is the first and arguably most salient auditory component humans are exposed to since the beginning of life. It carries multiple linguistic (e.g., word meaning) and paralinguistic (e.g., speakers’ emotion) functions in speech and communication. The mappings between these functions and ƒ0 features vary within a language and differ cross-linguistically. For instance, a rising pitch can be perceived as a question in English but a lexical tone in Mandarin. Such variations mean that infants must learn the specific mappings based on their respective linguistic and social environments. To date, canonical theoretical frameworks and most empirical studies do not view or consider the multi-functionality of ƒ0, but typically focus on individual functions. More importantly, despite the eventual mastery of ƒ0 in communication, it is unclear how infants learn to decompose and recognize these overlapping functions carried by ƒ0. In this paper, we review the symbioses and synergies of the lexical, intonational, and emotional functions that can be carried by ƒ0 and are being acquired throughout infancy. On the basis of our review, we put forward the Learnability Hypothesis that infants decompose and acquire multiple ƒ0 functions through native/environmental experiences. Under this hypothesis, we propose representative cases such as the synergy scenario, where infants use visual cues to disambiguate and decompose the different ƒ0 functions. Further, viable ways to test the scenarios derived from this hypothesis are suggested across auditory and visual modalities. Discovering how infants learn to master the diverse functions carried by ƒ0 can increase our understanding of linguistic systems, auditory processing and communication functions

    Does speech prosody matter in health communication? Evidence from native and non-native English speaking medical students in a simulated clinical interaction

    Get PDF
    The impact of the UK’s multilingual and multicultural society today can be seen in its healthcare services and have contributed towards shaping communication skills training as a core part of the UK undergraduate medical curriculum. NHS complaints statistics involving perceived staff attitudes have remained high, despite extensive communication skills training. Furthermore, foreign doctors have received a higher proportion of complaints than UK doctors. Finally, how linguistic and social factors shape the conveyance and perception of attitudes related to professionalism in medical communication remains poorly understood. The ultimate aim of this study was to ascertain if speech prosody contributes to the perception of professionalism in medical communication. Research questions on the role of speech prosody in conveying professional attitudes in medical communication, the prosodic differences between native and non-native English speaking medical students in a simulated clinical interaction, and the influence of prosodic features on listeners’ perceptions of professional attitudes were addressed. A set of acoustic parameters representing the speech prosody of native and non-native medical students in the simulated clinical setting was analysed. A perceptual experiment was then carried out to investigate the factors affecting perceived professionalism in extracts of the analysed simulated clinical interaction. The examined acoustic parameters were found to be sensitive to the English language background and the task within the simulated consultation. Interestingly, the attitudinal information associated with some of these acoustic parameters were perceived by listeners and were reflected by higher professional scale scores in the perceptual experiment, even after adjusting for the English language background. The factors of training level and consultation task also emerged to be affecting professional scale scores. Initial findings have confirmed that speech prosody plays a role in terms of contributing towards the perception of professionalism in medical communication. Incorporating how messages are delivered to patients into current models of communication skills training may have positive outcomes

    Modelling prosodic and dialogue information for automatic speech recognition

    Get PDF

    A Study of Accomodation of Prosodic and Temporal Features in Spoken Dialogues in View of Speech Technology Applications

    Get PDF
    Inter-speaker accommodation is a well-known property of human speech and human interaction in general. Broadly it refers to the behavioural patterns of two (or more) interactants and the effect of the (verbal and non-verbal) behaviour of each to that of the other(s). Implementation of thisbehavior in spoken dialogue systems is desirable as an improvement on the naturalness of humanmachine interaction. However, traditional qualitative descriptions of accommodation phenomena do not provide sufficient information for such an implementation. Therefore, a quantitativedescription of inter-speaker accommodation is required. This thesis proposes a methodology of monitoring accommodation during a human or humancomputer dialogue, which utilizes a moving average filter over sequential frames for each speaker. These frames are time-aligned across the speakers, hence the name Time Aligned Moving Average (TAMA). Analysis of spontaneous human dialogue recordings by means of the TAMA methodology reveals ubiquitous accommodation of prosodic features (pitch, intensity and speech rate) across interlocutors, and allows for statistical (time series) modeling of the behaviour, in a way which is meaningful for implementation in spoken dialogue system (SDS) environments.In addition, a novel dialogue representation is proposed that provides an additional point of view to that of TAMA in monitoring accommodation of temporal features (inter-speaker pause length and overlap frequency). This representation is a percentage turn distribution of individual speakercontributions in a dialogue frame which circumvents strict attribution of speaker-turns, by considering both interlocutors as synchronously active. Both TAMA and turn distribution metrics indicate that correlation of average pause length and overlap frequency between speakers can be attributed to accommodation (a debated issue), and point to possible improvements in SDS “turntaking” behaviour. Although the findings of the prosodic and temporal analyses can directly inform SDS implementations, further work is required in order to describe inter-speaker accommodation sufficiently, as well as to develop an adequate testing platform for evaluating the magnitude ofperceived improvement in human-machine interaction. Therefore, this thesis constitutes a first step towards a convincingly useful implementation of accommodation in spoken dialogue systems

    Tone Sandhi Phenomena In Taiwan Southern Min

    Get PDF
    This dissertation investigates various aspects of the tone sandhi phenomena in Taiwan Southern Min (TSM). Previous studies have reported complete tonal neutralization between the two sandhi 33 variants derived respectively from citation 55 and 24 variants, leading to the claim that tone sandhi in this language is categorical. The fact that tone sandhi in TSM is assumed to possess a mixture of properties of lexical and postlexical rules gives rise to the debate over the status of this phonological rule. The findings of the dissertation shows incomplete neutralization between the two sandhi 33 variants with an indication of an ongoing sound change towards a near- or complete tonal merger, possibly led by female speakers. In addition, citation form is proposed to be more underlyingly represented on account of the fact that subjects, especially old speakers, have stronger association with citation variants than with sandhi variants in the priming experiment. The spontaneous corpus study suggests that the Tone Circle is merely a phonological idealization in light of the systematic subphonemic difference in f0 between citation X and sandhi X that are supposed to correspond even with some control of conceivable confounding factors. By comparing direct- and indirect-reference models, I argue that tone sandhi in TSM should be analyzed as a head-left Concatenation rule within a DM-based theoretical framework

    Toward invariant functional representations of variable surface fundamental frequency contours: Synthesizing speech melody via model-based stochastic learning

    Get PDF
    Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer—A trainable yet deterministic prosody synthesizer based on an articulatory–functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation—implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed
    corecore