926 research outputs found

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    Listeners use intonational phrase boundaries to project turn ends in spoken interaction

    No full text
    In conversation, turn transitions between speakers often occur smoothly, usually within a time window of a few hundred milliseconds. It has been argued, on the basis of a button-press experiment [De Ruiter, J. P., Mitterer, H., & Enfield, N. J. (2006). Projecting the end of a speaker's turn: A cognitive cornerstone of conversation. Language, 82(3):515–535], that participants in conversation rely mainly on lexico-syntactic information when timing and producing their turns, and that they do not need to make use of intonational cues to achieve smooth transitions and avoid overlaps. In contrast to this view, but in line with previous observational studies, our results from a dialogue task and a button-press task involving questions and answers indicate that the identification of the end of intonational phrases is necessary for smooth turn-taking. In both tasks, participants never responded to questions (i.e., gave an answer or pressed a button to indicate a turn end) at turn-internal points of syntactic completion in the absence of an intonational phrase boundary. Moreover, in the button-press task, they often pressed the button at the same point of syntactic completion when the final word of an intonational phrase was cross-spliced at that location. Furthermore, truncated stimuli ending in a syntactic completion point but lacking an intonational phrase boundary led to significantly delayed button presses. In light of these results, we argue that earlier claims that intonation is not necessary for correct turn-end projection are misguided, and that research on turn-taking should continue to consider intonation as a source of turn-end cues along with other linguistic and communicative phenomena

    An exploration of the rhythm of Malay

    Get PDF
    In recent years there has been a surge of interest in speech rhythm. However we still lack a clear understanding of the nature of rhythm and rhythmic differences across languages. Various metrics have been proposed as means for measuring rhythm on the phonetic level and making typological comparisons between languages (Ramus et al, 1999; Grabe & Low, 2002; Dellwo, 2006) but the debate is ongoing on the extent to which these metrics capture the rhythmic basis of speech (Arvaniti, 2009; Fletcher, in press). Furthermore, cross linguistic studies of rhythm have covered a relatively small number of languages and research on previously unclassified languages is necessary to fully develop the typology of rhythm. This study examines the rhythmic features of Malay, for which, to date, relatively little work has been carried out on aspects rhythm and timing. The material for the analysis comprised 10 sentences produced by 20 speakers of standard Malay (10 males and 10 females). The recordings were first analysed using rhythm metrics proposed by Ramus et. al (1999) and Grabe & Low (2002). These metrics (∆C, %V, rPVI, nPVI) are based on durational measurements of vocalic and consonantal intervals. The results indicated that Malay clustered with other so-called syllable-timed languages like French and Spanish on the basis of all metrics. However, underlying the overall findings for these metrics there was a large degree of variability in values across speakers and sentences, with some speakers having values in the range typical of stressed-timed languages like English. Further analysis has been carried out in light of Fletcher’s (in press) argument that measurements based on duration do not wholly reflect speech rhythm as there are many other factors that can influence values of consonantal and vocalic intervals, and Arvaniti’s (2009) suggestion that other features of speech should also be considered in description of rhythm to discover what contributes to listeners’ perception of regularity. Spectrographic analysis of the Malay recordings brought to light two parameters that displayed consistency and regularity for all speakers and sentences: the duration of individual vowels and the duration of intervals between intensity minima. This poster presents the results of these investigations and points to connections between the features which seem to be consistently regulated in the timing of Malay connected speech and aspects of Malay phonology. The results are discussed in light of current debate on the descriptions of rhythm

    Creak as a feature of lexical stress in Estonian

    Get PDF
    Peer reviewe

    Suprasegmental transcription

    Get PDF
    No abstrac

    Dialectal phonology constrains the phonetics of prominence

    Get PDF
    Accentual prominence has well-documented effects on various phonetic properties, including timing, vowel quality, amplitude, and pitch. These cues can exist in trading relationships and can differ in magnitude in different languages. Less is understood about how phonetic cues to accentuation surface under different phonological constraints, such as those posed by segmental phonology, aspects of the prosodic hierarchy, and intonational phonology. Dialectal comparisons offer a valuable window on these issues, because dialects of a language share basic aspects of structure and function, but can differ in key segmental and suprasegmental constraints which may affect the cues that realise accentual prominence. We compared the realisation of trochaic words (e.g. cheesy, picky) in accented/unaccented and phrase-final/non-final positions in two dialects of British English, Standard Southern British English, and Standard Scottish English as spoken in Glasgow. We found generally shallower prominence gradients for Glasgow than SSBE with respect to intensity and duration, and very little evidence of accentual lengthening of vowels in Glasgow, compared to robust effects in SSBE. In contrast, phrase-finality had similar effects across the two dialects. The differences observed illustrate how the expression of accentual prominence reflects and reveals the different segmental and intonational systems that operate within dialects of the same language

    A cross-linguistic analysis of the temporal dynamics of turn-taking cues using machine learning as a descriptive tool

    Get PDF
    In dialogue, speakers produce and perceive acoustic/prosodic turn-taking cues, which are fundamental for negotiating turn exchanges with their interlocutors. However, little of the temporal dynamics and cross-linguistic validity of these cues is known. In this work, we explore a set of acoustic/prosodic cues preceding three turn-transition types (hold, switch and backchannel) in three different languages (Slovak, American English and Argentine Spanish). For this, we use and refine a set of machine learning techniques that enable a finer-grained temporal analysis of such cues, as well as a comparison of their relative explanatory power. Our results suggest that the three languages, despite belonging to distinct linguistic families, share the general usage of a handful of acoustic/prosodic features to signal turn transitions. We conclude that exploiting features such as speech rate, final-word lengthening, the pitch track over the final 200 ms, the intensity track over the final 1000 ms, and noise-to-harmonics ratio (a voice-quality feature) might prove useful for further improving the accuracy of the turn-taking modules found in modern spoken dialogue systems.Fil: Brusco, Pablo. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Vidal, Jazmín. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; ArgentinaFil: Beňuš, Štefan. University in Nitra; Eslovaquia. Slovak Academy of Sciences; EslovaquiaFil: Gravano, Agustin. Consejo Nacional de Investigaciones Científicas y Técnicas. Oficina de Coordinación Administrativa Ciudad Universitaria. Instituto de Investigación en Ciencias de la Computación. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Instituto de Investigación en Ciencias de la Computación; Argentina. Universidad de Buenos Aires. Facultad de Ciencias Exactas y Naturales. Departamento de Computación; Argentin

    Pitch and the projection of more talk

    Get PDF
    This study investigates prototypically ‘turn-final’ pitch features (fall-to-low) at points of possible turn-completion where the same speaker continues. It is shown that points of possible turn-completion accompanied by fall-to-low and followed by same-speaker continuation only rarely engender incoming talk. It is shown that such points are frequently accompanied by non-pitch talk-projecting phonetic features, and that the presence of these features may constrain the nature of any incoming talk. The results of the study should serve as caution to researchers with regard to an over-emphasis on intonation when describing and analysing talk-in-interaction. Data are from audio recordings of American English telephone calls

    Hesitations in Spoken Dialogue Systems

    Get PDF
    Betz S. Hesitations in Spoken Dialogue Systems. Bielefeld: Universität Bielefeld; 2020
    corecore