78 research outputs found

    Sentence boundary detection in chinese broadcast news using conditional random fields and prosodic features

    Full text link
    In this paper, we explore the use of prosodic features in sen-tence boundary detection in Chinese broadcast news. The prosodic features include speaker turn, music, pause dura-tion, pitch, energy and speaking rate. Specifically, consider-ing the Chinese tonal effects in pitch trajectory, we propose to use tone-normalized pitch features. Experiments using deci-sion trees demonstrate that the tone-normalized pitch features show superior performance in sentence boundary detection in Chinese broadcast news. Furthermore, feature combination is able to achieve apparent performance improvement by in-tuitive feature interactive rules formed in the decision tree. Pause duration and a tone-normalized pitch feature contribute the most part of the feature usage in the best-performing de-cision tree. Index Terms — sentence boundary detection, sentence segmentation, speech prosody, rich transcription 1

    Spoken content retrieval: A survey of techniques and technologies

    Get PDF
    Speech media, that is, digital audio and video containing spoken content, has blossomed in recent years. Large collections are accruing on the Internet as well as in private and enterprise settings. This growth has motivated extensive research on techniques and technologies that facilitate reliable indexing and retrieval. Spoken content retrieval (SCR) requires the combination of audio and speech processing technologies with methods from information retrieval (IR). SCR research initially investigated planned speech structured in document-like units, but has subsequently shifted focus to more informal spoken content produced spontaneously, outside of the studio and in conversational settings. This survey provides an overview of the field of SCR encompassing component technologies, the relationship of SCR to text IR and automatic speech recognition and user interaction issues. It is aimed at researchers with backgrounds in speech technology or IR who are seeking deeper insight on how these fields are integrated to support research and development, thus addressing the core challenges of SCR

    A Comparison of Cue-Weighting in the Perception of Prosodic Phrase Boundaries in English and Chinese.

    Full text link
    Prosodic phrasing plays an important role in language comprehension and processing. The present study investigates the acoustic correlates used in the production and perception of prosodic phrase boundaries. Specifically, it examines the perceptual weighting of these cues contributing to the marking of prosodic phrase boundaries differ in two languages, English and Chinese, with a focus on the difference in the perceptual reliance on pitch information by speakers of languages with and without lexical tone. A production study examined the realization of pause duration, pre-boundary lengthening, and F0 change in syntactically ambiguous utterance pairs contrasting in the presence and absence of prosodic boundaries (e.g. coffee, cake vs. coffee cake) in English and Chinese. Results showed that speakers of both languages utilized durational (pause and pre-boundary lengthening) and pitch cues to signal phrase boundaries. Speakers of these languages differ, however, in the type of pitch information they employed for boundary categories: in English, F0 slope (representing dynamics of the pitch contour) was found to be an effective predictor; whereas in Chinese, pitch information was conveyed by a reset of the pitch declination. A perception study investigated the relative weighting assigned by native English and Chinese speakers to these temporal and spectral properties in prosodic boundary perception. Responses to an identification task showed that both English and Chinese listeners use pause, pre-boundary lengthening, and pitch in perceiving prosodic boundaries in their native language. However, the two groups of listeners weight these cues differently, with English listeners attending more to pause than the other two cues, while Chinese listeners weight pitch reset most heavily. These differences in perceptual weighting indicate an effect of language experience on the relative importance of perceptual cues. Language experience modulates the listener’s attention to cues that are particularly relevant in the native language. Native speakers of a tone language attend to pitch information more than do native speakers of a non-tonal language because of the phonemic status of pitch in their native language.PHDLinguisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/96107/1/zxt_1.pd

    Perceptual learning of context-sensitive phonetic detail

    Get PDF
    [Abstract abbreviated due to inability of DSpace@Cambridge to display phonetic symbols. Please see the full abstract in the attached pdf file.] Although familiarity with a talker or accent is known to facilitate perception, it is not clear what underlies this phenomenon. Previous research has focused primarily on whether listeners can learn to associate novel phonetic characteristics with low-level units such as features or phonemes. However, this neglects the potential role of phonetic information at many other levels of representation. To address this shortcoming, this thesis investigated perceptual learning of systematic phonetic detail relating to higher levels of linguistic structure, including prosodic, grammatical and morphological contexts. Furthermore, in contrast to many previous studies, this research used relatively natural stimuli and tasks, thus maximising its relevance to perceptual learning in ordinary listening situations. This research shows that listeners can update their phonetic representations in response to incoming information and its relation to linguistic-structural context. In addition, certain patterns of systematic phonetic detail were more learnable than others. These findings are used to inform an account of how new information is integrated with prior experience in speech processing, within a framework that emphasises the importance of phonetic detail at multiple levels of representation.This work was funded by an AHRC grant

    Proceedings of the VIIth GSCP International Conference

    Get PDF
    The 7th International Conference of the Gruppo di Studi sulla Comunicazione Parlata, dedicated to the memory of Claire Blanche-Benveniste, chose as its main theme Speech and Corpora. The wide international origin of the 235 authors from 21 countries and 95 institutions led to papers on many different languages. The 89 papers of this volume reflect the themes of the conference: spoken corpora compilation and annotation, with the technological connected fields; the relation between prosody and pragmatics; speech pathologies; and different papers on phonetics, speech and linguistic analysis, pragmatics and sociolinguistics. Many papers are also dedicated to speech and second language studies. The online publication with FUP allows direct access to sound and video linked to papers (when downloaded)

    Speech Recognition

    Get PDF
    Chapters in the first part of the book cover all the essential speech processing techniques for building robust, automatic speech recognition systems: the representation for speech signals and the methods for speech-features extraction, acoustic and language modeling, efficient algorithms for searching the hypothesis space, and multimodal approaches to speech recognition. The last part of the book is devoted to other speech processing applications that can use the information from automatic speech recognition for speaker identification and tracking, for prosody modeling in emotion-detection systems and in other speech processing applications that are able to operate in real-world environments, like mobile communication services and smart homes

    Film Dialogue Translation And The Intonation Unit : Towards Equivalent Effect In English And Chinese

    Get PDF
    This thesis proposes a new approach to film dialogue translation (FDT) with special reference to the translation process and quality of English-to-Chinese dubbing. In response to the persistent translation failures that led to widespread criticism of dubbed films and TV plays in China for their artificial \u27translation talk\u27, this study provides a pragmatic methodology derived from the integration of the theories and analytical systems of information flow in the tradition of the functionalist approach to speech and writing with the relevant theoretical and empirical findings from TS and other related branches of linguistics. It has developed and validated a translation model (FITNIATS) which makes the intonation unit (IU) the central unit of film dialogue translation. Arguing that any translation which treats dubbing as a simple script-to-script process, without transferring the prosodic properties of the spoken words into the commensurate functions of TL, is incomplete, the thesis demonstrates that, in order to reduce confusion and loss of meaning/rhythm, the SL dialogue should be rendered in the IUs with the stressed syllables well-timed in TL to keep the corresponding information foci in sync with the visual message. It shows that adhering to the sentence-to-sentence formula as the translation metastrategy with the information structure of the original film dialogue permuted can result in serious stylistic as well as communicative problems. Five key theoretical issues in TS are addressed in the context of FDT, viz., the relations between micro-structure and macro-structure translation perspectives, foreignizing vs. domesticating translation, the unit of translation, the levels of translation equivalence and the criteria for evaluating translation quality. lf equivalent effect is to be achieved in all relevant dimensions, it is argued that \u27FITness criteria\u27 need to be met in film translation assessment, and four such criteria arc proposed. This study demonstrates that prosody and word order, as sensitive indices of the information flow which occurs in film dialogue through the creation and perception of meaning, can provide a basis for minimizing cross-linguistic discrepancies and compensating for loss of the FIT functions, especially where conflicts arise between the syntactic and/or medium constraints and the adequate transfer of cultural-specific content and style. The implications of the model for subtitling arc also made explicit

    Toward summarization of communicative activities in spoken conversation

    Get PDF
    This thesis is an inquiry into the nature and structure of face-to-face conversation, with a special focus on group meetings in the workplace. I argue that conversations are composed of episodes, each of which corresponds to an identifiable communicative activity such as giving instructions or telling a story. These activities are important because they are part of participants’ commonsense understanding of what happens in a conversation. They appear in natural summaries of conversations such as meeting minutes, and participants talk about them within the conversation itself. Episodic communicative activities therefore represent an essential component of practical, commonsense descriptions of conversations. The thesis objective is to provide a deeper understanding of how such activities may be recognized and differentiated from one another, and to develop a computational method for doing so automatically. The experiments are thus intended as initial steps toward future applications that will require analysis of such activities, such as an automatic minute-taker for workplace meetings, a browser for broadcast news archives, or an automatic decision mapper for planning interactions. My main theoretical contribution is to propose a novel analytical framework called participant relational analysis. The proposal argues that communicative activities are principally indicated through participant-relational features, i.e., expressions of relationships between participants and the dialogue. Participant-relational features, such as subjective language, verbal reference to the participants, and the distribution of speech activity amongst the participants, are therefore argued to be a principal means for analyzing the nature and structure of communicative activities. I then apply the proposed framework to two computational problems: automatic discourse segmentation and automatic discourse segment labeling. The first set of experiments test whether participant-relational features can serve as a basis for automatically segmenting conversations into discourse segments, e.g., activity episodes. Results show that they are effective across different levels of segmentation and different corpora, and indeed sometimes more effective than the commonly-used method of using semantic links between content words, i.e., lexical cohesion. They also show that feature performance is highly dependent on segment type, suggesting that human-annotated “topic segments” are in fact a multi-dimensional, heterogeneous collection of topic and activity-oriented units. Analysis of commonly used evaluation measures, performed in conjunction with the segmentation experiments, reveals that they fail to penalize substantially defective results due to inherent biases in the measures. I therefore preface the experiments with a comprehensive analysis of these biases and a proposal for a novel evaluation measure. A reevaluation of state-of-the-art segmentation algorithms using the novel measure produces substantially different results from previous studies. This raises serious questions about the effectiveness of some state-of-the-art algorithms and helps to identify the most appropriate ones to employ in the subsequent experiments. I also preface the experiments with an investigation of participant reference, an important type of participant-relational feature. I propose an annotation scheme with novel distinctions for vagueness, discourse function, and addressing-based referent inclusion, each of which are assessed for inter-coder reliability. The produced dataset includes annotations of 11,000 occasions of person-referring. The second set of experiments concern the use of participant-relational features to automatically identify labels for discourse segments. In contrast to assigning semantic topic labels, such as topical headlines, the proposed algorithm automatically labels segments according to activity type, e.g., presentation, discussion, and evaluation. The method is unsupervised and does not learn from annotated ground truth labels. Rather, it induces the labels through correlations between discourse segment boundaries and the occurrence of bracketing meta-discourse, i.e., occasions when the participants talk explicitly about what has just occurred or what is about to occur. Results show that bracketing meta-discourse is an effective basis for identifying some labels automatically, but that its use is limited if global correlations to segment features are not employed. This thesis addresses important pre-requisites to the automatic summarization of conversation. What I provide is a novel activity-oriented perspective on how summarization should be approached, and a novel participant-relational approach to conversational analysis. The experimental results show that analysis of participant-relational features is

    Automatic recognition of multiparty human interactions using dynamic Bayesian networks

    Get PDF
    Relating statistical machine learning approaches to the automatic analysis of multiparty communicative events, such as meetings, is an ambitious research area. We have investigated automatic meeting segmentation both in terms of “Meeting Actions” and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine grained level highlighting individual speaker intentions. Group meeting actions describe the same process at a coarse level, highlighting interactions between different meeting participants and showing overall group intentions. A framework based on probabilistic graphical models such as dynamic Bayesian networks (DBNs) has been investigated for both tasks. Our first set of experiments is concerned with the segmentation and structuring of meetings (recorded using multiple cameras and microphones) into sequences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video recordings. We relate these lowlevel multimodal features to complex group behaviours proposing a multistreammodelling framework based on dynamic Bayesian networks. Later experiments are concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty conversational speech. We present a joint generative approach based on a switching DBN for DA recognition in which segmentation and classification of DAs are carried out in parallel. This approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. The DBN based approach yielded significant improvements when applied both to the meeting action and the dialogue act recognition task. On both tasks, the DBN framework provided an effective factorisation of the state-space and a flexible infrastructure able to integrate a heterogeneous set of resources such as continuous and discrete multimodal features, and statistical language models. Although our experiments have been principally targeted on multiparty meetings; features, models, and methodologies developed in this thesis can be employed for a wide range of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features for several related research areas such as speaker addressing and focus of attention modelling, automatic speech recognition and understanding, topic and decision detection

    Perception and Acquisition of Natural Authentic English Speech for Chinese Learners Using DIT\u27s Speech Technologies

    Get PDF
    Given that Chinese language learners are greatly influenced by their mother-tongue, which is a tone language rather than an intonation language, learning and coping with authentic English speech seems more difficult than for learners of other languages. The focus of the current research is, on the basis of analysis of the nature of spoken English and spoken Chinese, to help Chinese learners derive benefit from ICT technologies developed by the Technological University Dublin (DIT). The thesis concentrates on investigating the application of speech technologies in bridging the gap between students’ internalised, idealised formulations and natural, authentic English speech. Part of the testing carried out by the present author demonstrates the acceptability of a slow-down algorithm in facilitating Chinese learners of English in re-producing formulaic language. This algorithm is useful because it can slow down audio files to any desired speed between 100% and 40% without distortion, so as to allow language learners to pay attention to the real, rapid flow of ‘messy’ speech and follow the intonation patterns contained in them. The rationale for and the application of natural, dialogic native-to-native English speech to language learning is also explored. The Chinese language learners involved in this study are exposed to authentic, native speech patterns by providing them access to real, informal dialogue in various contexts. In the course of this analysis, the influence of speed of delivery and pitch range on the categorisation of formulaic language is also investigated. The study investigates the potential of the speech tools available to the present author as an effective EFL learning facility, especially for speakers of tone languages, and their role in helping language learners achieve confluent interaction in an English L1 environment
    corecore