929 research outputs found

    Prosodic phrase segmentation by pitch pattern clustering

    Get PDF
    This paper proposes a novel method for detecting the optimal sequence of prosodic phrases from continuous speech based on data-driven approach. The pitch pattern of input speech is divided into prosodic segments which minimized the overall distortion with pitch pattern templates of accent phrases by using the One Pass search algorithm. The pitch pattern templates are designed by clustering a large number of training samples of accent phrases. On the ATR continuous speech database uttered by 10 speakers, the rate of correct segmentation was 91.7% maximum for the same sex data of training and testing, 88.6% for the opposite sex

    Automatisation of intonation modelling and its linguistic anchoring

    Get PDF
    This paper presents a fully machine-driven approach for intonation description and its linguistic interpretation. For this purpose,a new intonation model for bottom-up F0 contour analysis and synthesis is introduced, the CoPaSul model which is designed in the tradition of parametric, contour-based, and superpositional approaches. Intonation is represented by a superposition of global and local contour classes that are derived from F0 parameterisation. These classes were linguistically anchored with respect to information status by aligning them with a text which had been coarsely analysed for this purpose by means of NLP techniques. To test the adequacy of this data-driven interpretation a perception experiment was carried out, which confirmed 80% of the findings

    Infants segment words from songs - an EEG study

    No full text
    Children’s songs are omnipresent and highly attractive stimuli in infants’ input. Previous work suggests that infants process linguistic–phonetic information from simplified sung melodies. The present study investigated whether infants learn words from ecologically valid children’s songs. Testing 40 Dutch-learning 10-month-olds in a familiarization-then-test electroencephalography (EEG) paradigm, this study asked whether infants can segment repeated target words embedded in songs during familiarization and subsequently recognize those words in continuous speech in the test phase. To replicate previous speech work and compare segmentation across modalities, infants participated in both song and speech sessions. Results showed a positive event-related potential (ERP) familiarity effect to the final compared to the first target occurrences during both song and speech familiarization. No evidence was found for word recognition in the test phase following either song or speech. Comparisons across the stimuli of the present and a comparable previous study suggested that acoustic prominence and speech rate may have contributed to the polarity of the ERP familiarity effect and its absence in the test phase. Overall, the present study provides evidence that 10-month-old infants can segment words embedded in songs, and it raises questions about the acoustic and other factors that enable or hinder infant word segmentation from songs and speech

    Prosody and Kinesics Based Co-analysis Towards Continuous Gesture Recognition

    Get PDF
    The aim of this study is to develop a multimodal co-analysis framework for continuous gesture recognition by exploiting prosodic and kinesics manifestation of natural communication. Using this framework, a co-analysis pattern between correlating components is obtained. The co-analysis pattern is clustered using K-means clustering to determine how well the pattern distinguishes the gestures. Features of the proposed approach that differentiate it from the other models are its less susceptibility to idiosyncrasies, its scalability, and simplicity. The experiment was performed on Multimodal Annotated Gesture Corpus (MAGEC) that we created for research on understanding non-verbal communication community, particularly the gestures

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Accent Phrase Segmentation by Finding N-Best Sequences of Pitch Pattern Templates

    Get PDF
    This paper describes a prosodic method for segmenting continuous speech into accent phrases. Optimum sequences are obtained on the basis of least squared error criterion by using dynamic time warping between F0 contours of input speech and reference accent patterns called 'pitch pattern templates'. But the optimum sequence does not always give good agreement with phrase boundaries labeled by hand, while the second or the third optimum candidate sequence does well. Therefore, we expand our system to be able to find out multiple candidates by using N-best algorithm. Evaluation tests were carried out using the ATR continuous speech database of 10 speakers. The results showed about 97% of phrase boundaries were correctly detected when we took 30-best candidates, and this accuracy is 7.5% higher than the conventional method without using N-best search algorithm

    Robust Estimation of Tone Break Indices from Speech Signal using Multi-Scale Analysis and their Applications

    Get PDF
    The aim of this study is to develop robust algorithm to automatically detect the Tone and Break Indices(ToBI) from the speech signal and explore their applications. iLAST was introduced to analyze the acoustic and prosodic features to detect the ToBI indices. Both expert and data driven rules were used to improve the robustness. The integration of multi-scale signal analysis with rule-based classification has helped in robustly identifying tones that can be used in applications, such as identifying Vowel triangle, emotions from speech etc. Empirical analyses using labeled dataset were performed to illustrate the utility of the proposed approach. Further analyses were conducted to identify the inefficiencies with the proposed approach and address those issues through co-analyses of prosodic features in identifying the major contributors to robust detection of ToBI. It was demonstrated that the proposed approach performs robustly and can be used for developing a wide variety of applications

    Prosodic constraints on statistical strategies in segmenting fluent speech

    Get PDF
    Learning a spoken language is, in part, an input-driven process. However, the relevant units of speech like words or morphemes are not clearly marked in the speech input. This thesis explores some possible strategies to segment fluent speech. Two main strategies for segmenting fluent speech are considered. The first involves computing the distributional properties of the input stream. Previous research has established that adults and infants can use the transition probabilities (TPs) between syllables to segment speech. Specifically, researchers have found a preference for syllabic sequences which have relatively high average transition probabilities between the constituent syllables. The second strategy relies on the prosodic organization of speech. In particular, larger phrasal constituents of speech are invariably aligned with the boundaries of words. Thus, any sensitivity to the edges of such phrases will serve to place additional constraints on possible words. The main goal of this thesis is to understand how different strategies conspire together to provide a rich set of cues to segment speech. In particular, we explore how prosodic boundaries influence distributional strategies in segmenting fluent speech. The primary methodology employed is behavioral studies with Italian-speaking adults. In the initial experimental chapters, a novel paradigm is described for studying distributional strategies in segmenting artificial, fluent speech streams. This paradigm uses artificial speech containing syllabic noise, defined as the presence of syllables that do not comprise the target nonce words, but occur at random at comparable frequencies. It is shown that the presence of syllabic noise does not affect segmentation. This suggests that statistical computations are robust. We find that, although the presence of the noise syllables do not affect TP computations, the placement of nonce words with respect to each other does. In particular, 'words' with a clumped distribution are better segmented than 'words' with an even spacing. This suggests that even the process of statistical segmentation itself is constrained. The syllabic noise paradigm is utilized to create speech streams as sequences of frames: syllabic sequences of fixed length. 'Words' can be placed at arbitrary positions with respect to these frames; the remaining positions are occupied by noise syllables. By adding pitch and length characteristics of Intonational Phrases (IPs, which are large phrasal constituents) from the native language, the frames can be turned into prosodic 'phrases'. Thus, nonce words can be placed at different positions with respect to such 'phrases'. It is found that 'words' that straddle such 'phrases' are not preferred over non-words, while 'phrase'-internal 'words' are. Removing the prosodic aspects from the frames abrogates this effect. These initial experiments suggest that prosody carves speech streams into smaller constituents. Presumably, participants infer the edges of these 'phrases' as being edges of words, as in natural speech. It is well known that edge positions are salient. This suggests that 'words' at the edges of the 'phrases' should be better recognized than 'words' in the middles. The subsequent experiments show such an edge effect of prosody. The previous results are ambiguous as to the whether prosody blocks the computation of TPs across phrasal boundaries, or acts at a later stage to suppress the outcome of TP computations. It is seen that prosody does not block TP computations: under certain conditions one can find evidence that participants compute TPs for both 'phrase'-medial and phrase'-straddling 'words'. These results suggest that prosody acts as a filter against statistically cohesive 'words' that straddle prosodic boundaries. Based on these results, the prosodic filtering model is proposed. Next, we examine the generality of the prosodic filtering effect. It will be shown that a foreign prosody causes a similar perception of 'phrasal' edges; the edge effect and the filtering effect are both observed even with foreign IPs. Phonologists have proposed that IPs are universally marked by similar acoustic cues. Thus, the results with foreign prosody suggest that these universal cues play a role in the perception of phrases in fluent speech. Such cues include final lengthening and final pitch decline; further experiments show that, at least in the experimental paradigm used in this thesis, pitch decline plays the primary role in the perception of 'phrases'. Finally, we consider the possible bases for the perception of prosodic edges in otherwise fluent speech. It is suggested that this capacity is not purely linguistic, but arises from acoustic perception: we will see that time-reversed IPs, which maintains pitch breaks at 'phrasal' boundaries, can still induce the filtering effect. In an annex, the question of how time-reversed (backward) speech is perceived in neonates is addressed. In a brain imaging (OT) study with neonates, we find evidence that forward speech is processed differently from backward speech, replicating previous results. In conclusion, the task of finding word boundaries in fluent speech is highly constrained. These constraints can be understood as the natural limitations that ensue when multiple cognitive systems interact in solving particular tasks
    • 

    corecore