11,554 research outputs found

    Tracking of enriched dialog states for flexible conversational information access

    Full text link
    Dialog state tracking (DST) is a crucial component in a task-oriented dialog system for conversational information access. A common practice in current dialog systems is to define the dialog state by a set of slot-value pairs. Such representation of dialog states and the slot-filling based DST have been widely employed, but suffer from three drawbacks. (1) The dialog state can contain only a single value for a slot, and (2) can contain only users' affirmative preference over the values for a slot. (3) Current task-based dialog systems mainly focus on the searching task, while the enquiring task is also very common in practice. The above observations motivate us to enrich current representation of dialog states and collect a brand new dialog dataset about movies, based upon which we build a new DST, called enriched DST (EDST), for flexible accessing movie information. The EDST supports the searching task, the enquiring task and their mixed task. We show that the new EDST method not only achieves good results on Iqiyi dataset, but also outperforms other state-of-the-art DST methods on the traditional dialog datasets, WOZ2.0 and DSTC2.Comment: 5 pages, 2 figures, accepted by ICASSP201

    Deep Multimodal Speaker Naming

    Full text link
    Automatic speaker naming is the problem of localizing as well as identifying each speaking character in a TV/movie/live show video. This is a challenging problem mainly attributes to its multimodal nature, namely face cue alone is insufficient to achieve good performance. Previous multimodal approaches to this problem usually process the data of different modalities individually and merge them using handcrafted heuristics. Such approaches work well for simple scenes, but fail to achieve high performance for speakers with large appearance variations. In this paper, we propose a novel convolutional neural networks (CNN) based learning framework to automatically learn the fusion function of both face and audio cues. We show that without using face tracking, facial landmark localization or subtitle/transcript, our system with robust multimodal feature extraction is able to achieve state-of-the-art speaker naming performance evaluated on two diverse TV series. The dataset and implementation of our algorithm are publicly available online

    Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic Embedding

    Full text link
    Optimal control is notoriously difficult for stochastic nonlinear systems. Ren et al. introduced Spectral Dynamics Embedding for developing reinforcement learning methods for controlling an unknown system. It uses an infinite-dimensional feature to linearly represent the state-value function and exploits finite-dimensional truncation approximation for practical implementation. However, the finite-dimensional approximation properties in control have not been investigated even when the model is known. In this paper, we provide a tractable stochastic nonlinear control algorithm that exploits the nonlinear dynamics upon the finite-dimensional feature approximation, Spectral Dynamics Embedding Control (SDEC), with an in-depth theoretical analysis to characterize the approximation error induced by the finite-dimension truncation and statistical error induced by finite-sample approximation in both policy evaluation and policy optimization. We also empirically test the algorithm and compare the performance with Koopman-based methods and iLQR methods on the pendulum swingup problem
    • …
    corecore