11,554 research outputs found
Tracking of enriched dialog states for flexible conversational information access
Dialog state tracking (DST) is a crucial component in a task-oriented dialog
system for conversational information access. A common practice in current
dialog systems is to define the dialog state by a set of slot-value pairs. Such
representation of dialog states and the slot-filling based DST have been widely
employed, but suffer from three drawbacks. (1) The dialog state can contain
only a single value for a slot, and (2) can contain only users' affirmative
preference over the values for a slot. (3) Current task-based dialog systems
mainly focus on the searching task, while the enquiring task is also very
common in practice. The above observations motivate us to enrich current
representation of dialog states and collect a brand new dialog dataset about
movies, based upon which we build a new DST, called enriched DST (EDST), for
flexible accessing movie information. The EDST supports the searching task, the
enquiring task and their mixed task. We show that the new EDST method not only
achieves good results on Iqiyi dataset, but also outperforms other
state-of-the-art DST methods on the traditional dialog datasets, WOZ2.0 and
DSTC2.Comment: 5 pages, 2 figures, accepted by ICASSP201
Deep Multimodal Speaker Naming
Automatic speaker naming is the problem of localizing as well as identifying
each speaking character in a TV/movie/live show video. This is a challenging
problem mainly attributes to its multimodal nature, namely face cue alone is
insufficient to achieve good performance. Previous multimodal approaches to
this problem usually process the data of different modalities individually and
merge them using handcrafted heuristics. Such approaches work well for simple
scenes, but fail to achieve high performance for speakers with large appearance
variations. In this paper, we propose a novel convolutional neural networks
(CNN) based learning framework to automatically learn the fusion function of
both face and audio cues. We show that without using face tracking, facial
landmark localization or subtitle/transcript, our system with robust multimodal
feature extraction is able to achieve state-of-the-art speaker naming
performance evaluated on two diverse TV series. The dataset and implementation
of our algorithm are publicly available online
Stochastic Nonlinear Control via Finite-dimensional Spectral Dynamic Embedding
Optimal control is notoriously difficult for stochastic nonlinear systems.
Ren et al. introduced Spectral Dynamics Embedding for developing reinforcement
learning methods for controlling an unknown system. It uses an
infinite-dimensional feature to linearly represent the state-value function and
exploits finite-dimensional truncation approximation for practical
implementation. However, the finite-dimensional approximation properties in
control have not been investigated even when the model is known. In this paper,
we provide a tractable stochastic nonlinear control algorithm that exploits the
nonlinear dynamics upon the finite-dimensional feature approximation, Spectral
Dynamics Embedding Control (SDEC), with an in-depth theoretical analysis to
characterize the approximation error induced by the finite-dimension truncation
and statistical error induced by finite-sample approximation in both policy
evaluation and policy optimization. We also empirically test the algorithm and
compare the performance with Koopman-based methods and iLQR methods on the
pendulum swingup problem
- …