87 research outputs found

    The Dialog State Tracking Challenge Series: A Review

    Get PDF
    In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation -- such as the user's goal -- given all of the dialog history up to that turn.  Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress.  The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker.  This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled

    A Multi-Task Approach to Incremental Dialogue State Tracking

    Get PDF
    Incrementality is a fundamental feature of language in real world use. To this point, however, the vast majority of work in automated dialogue processing has focused on language as turn based. In this paper we explore the challenge of incremental dialogue state tracking through the development and analysis of a multi-task approach to incremental dialogue state tracking. We present the design of our incremental dialogue state tracker in detail and provide evaluation against the well known Dialogue State Tracking Challenge 2 (DSTC2) dataset. In addition to a standard evaluation of the tracker, we also provide an analysis of the Incrementality phenomenon in our model’s performance by analyzing how early our models can produce correct predictions and how stable those predictions are. We find that the Multi-Task Learning-based model achieves state-of-the-art results for incremental processing

    A data-driven approach to spoken dialog segmentation

    Get PDF
    In This Paper, We Present A Statistical Model For Spoken Dialog Segmentation That Decides The Current Phase Of The Dialog By Means Of An Automatic Classification Process. We Have Applied Our Proposal To Three Practical Conversational Systems Acting In Different Domains. The Results Of The Evaluation Show That Is Possible To Attain High Accuracy Rates In Dialog Segmentation When Using Different Sources Of Information To Represent The User Input. Our Results Indicate How The Module Proposed Can Also Improve Dialog Management By Selecting Better System Answers. The Statistical Model Developed With Human-Machine Dialog Corpora Has Been Applied In One Of Our Experiments To Human-Human Conversations And Provides A Good Baseline As Well As Insights In The Model Limitation

    Prosody-Based Automatic Segmentation of Speech into Sentences and Topics

    Get PDF
    A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models -- for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation.Comment: 30 pages, 9 figures. To appear in Speech Communication 32(1-2), Special Issue on Accessing Information in Spoken Audio, September 200

    EM Algorithms for Weighted-Data Clustering with Application to Audio-Visual Scene Analysis

    Get PDF
    Data clustering has received a lot of attention and numerous methods, algorithms and software packages are available. Among these techniques, parametric finite-mixture models play a central role due to their interesting mathematical properties and to the existence of maximum-likelihood estimators based on expectation-maximization (EM). In this paper we propose a new mixture model that associates a weight with each observed point. We introduce the weighted-data Gaussian mixture and we derive two EM algorithms. The first one considers a fixed weight for each observation. The second one treats each weight as a random variable following a gamma distribution. We propose a model selection method based on a minimum message length criterion, provide a weight initialization strategy, and validate the proposed algorithms by comparing them with several state of the art parametric and non-parametric clustering techniques. We also demonstrate the effectiveness and robustness of the proposed clustering technique in the presence of heterogeneous data, namely audio-visual scene analysis.Comment: 14 pages, 4 figures, 4 table

    Automatic recognition of multiparty human interactions using dynamic Bayesian networks

    Get PDF
    Relating statistical machine learning approaches to the automatic analysis of multiparty communicative events, such as meetings, is an ambitious research area. We have investigated automatic meeting segmentation both in terms of “Meeting Actions” and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine grained level highlighting individual speaker intentions. Group meeting actions describe the same process at a coarse level, highlighting interactions between different meeting participants and showing overall group intentions. A framework based on probabilistic graphical models such as dynamic Bayesian networks (DBNs) has been investigated for both tasks. Our first set of experiments is concerned with the segmentation and structuring of meetings (recorded using multiple cameras and microphones) into sequences of group meeting actions such as monologue, discussion and presentation. We outline four families of multimodal features based on speaker turns, lexical transcription, prosody, and visual motion that are extracted from the raw audio and video recordings. We relate these lowlevel multimodal features to complex group behaviours proposing a multistreammodelling framework based on dynamic Bayesian networks. Later experiments are concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty conversational speech. We present a joint generative approach based on a switching DBN for DA recognition in which segmentation and classification of DAs are carried out in parallel. This approach models a set of features, related to lexical content and prosody, and incorporates a weighted interpolated factored language model. In conjunction with this joint generative model, we have also investigated the use of a discriminative approach, based on conditional random fields, to perform a reclassification of the segmented DAs. The DBN based approach yielded significant improvements when applied both to the meeting action and the dialogue act recognition task. On both tasks, the DBN framework provided an effective factorisation of the state-space and a flexible infrastructure able to integrate a heterogeneous set of resources such as continuous and discrete multimodal features, and statistical language models. Although our experiments have been principally targeted on multiparty meetings; features, models, and methodologies developed in this thesis can be employed for a wide range of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features for several related research areas such as speaker addressing and focus of attention modelling, automatic speech recognition and understanding, topic and decision detection
    corecore