5 research outputs found
Recommended from our members
Recurrent Neural Network Language Generation for Dialogue Systems
Language is the principal medium for ideas, while dialogue is the most natural and effective way for humans to interact with and access information from machines. Natural language generation (NLG) is a critical component of spoken dialogue and it has a significant impact on usability and perceived quality. Many commonly used NLG systems employ rules and heuristics, which tend to generate inflexible and stylised responses without the natural variation of human language. However, the frequent repetition of identical output forms can quickly make dialogue become tedious for most real-world users. Additionally, these rules and heuristics are not scalable and hence not trivially extensible to other domains or languages. A statistical approach to language generation can learn language decisions directly from data without relying on hand-coded rules or heuristics, which brings scalability and flexibility to NLG. Statistical models also provide an opportunity to learn in-domain human colloquialisms and cross-domain model adaptations.
A robust and quasi-supervised NLG model is proposed in this thesis. The model leverages a Recurrent Neural Network (RNN)-based surface realiser and a gating mechanism applied to input semantics. The model is motivated by the Long-Short Term Memory (LSTM) network. The RNN-based surface realiser and gating mechanism use a neural network to learn end-to-end language generation decisions from input dialogue act and sentence pairs; it also integrates sentence planning and surface realisation into a single optimisation problem. The single optimisation not only bypasses the costly intermediate linguistic annotations but also generates more natural and human-like responses. Furthermore, a domain adaptation study shows that the proposed model can be readily adapted and extended to new dialogue domains via a proposed recipe.
Continuing the success of end-to-end learning, the second part of the thesis speculates on building an end-to-end dialogue system by framing it as a conditional generation problem. The proposed model encapsulates a belief tracker with a minimal state representation and a generator that takes the dialogue context to produce responses. These features suggest comprehension and fast learning. The proposed model is capable of understanding requests and accomplishing tasks after training on only a few hundred human-human dialogues. A complementary Wizard-of-Oz data collection method is also introduced to facilitate the collection of human-human conversations from online workers. The results demonstrate that the proposed model can talk to human judges naturally, without any difficulty, for a sample application domain. In addition, the results also suggest that the introduction of a stochastic latent variable can help the system model intrinsic variation in communicative intention much better.Tsung-Hsien Wen's Ph.D. is supported by Toshiba Research Europe Ltd, Cambridge Research Laborator
Recognising and localising human actions
Human action recognition in challenging video data is becoming an increasingly important research area. Given the growing number of cameras and robots pointing their lenses at humans, the need for automatic recognition of human actions arises, promising Google-style video search and automatic video summarisation/description. Furthermore, for any autonomous robotic system to interact with humans, it must rst be able to understand and quickly react to human actions.
Although the best action classication methods aggregate features from the entire video clip in which the action unfolds, this global representation may include irrelevant scene context and movements which are shared amongst multiple action classes. For example, a waving action may be performed whilst
walking, however if the walking movement appears in distinct action classes, then it should not be included in training a waving movement classier. For this reason, we propose an action classication framework in which more discriminative action subvolumes are learned in a weakly supervised setting, owing to the diculty of manually labelling massive video datasets. The learned models are used to simultaneously classify video clips and to localise actions to a given space-time subvolume. Each subvolume is cast as a bag-of-features (BoF) instance in a multiple-instance-learning framework, which
in turn is used to learn its class membership. We demonstrate quantitatively that even with single xed-sized subvolumes, the classication performance of our proposed algorithm is superior to our BoF baseline on the majority of performance measures, and shows promise for space-time action localisation on the most challenging video datasets.
Exploiting spatio-temporal structure in the video should also improve results, just as deformable part models have proven highly successful in object recognition. However, whereas objects have clear boundaries which means we can easily dene a ground truth for initialisation, 3D space-time actions are inherently ambiguous and expensive to annotate in large datasets. Thus, it is desirable to adapt pictorial star models to action datasets without location annotation, and to features invariant to changes in pose such as bag-of-feature and Fisher vectors, rather than low-level HoG. Thus, we propose local deformable spatial bag-of-features (LDSBoF) in which local discriminative regions are split into axed grid of parts that are allowed to deform in both space and time at test-time. In our experimental evaluation we demonstrate that by using local, deformable space-time action parts, we are able to achieve very competitive classification performance, whilst being able to localise actions even in the most challenging video datasets.
A recent trend in action recognition is towards larger and more challenging datasets, an increasing number of action classes and larger visual vocabularies. For the global classication of human action video clips, the bag-of-visual-words pipeline is currently the best performing. However, the strategies chosen to sample features and construct a visual vocabulary are critical to performance, in fact often dominating performance. Thus, we provide a critical evaluation of various approaches to building a vocabulary and show that good practises do have a signicant impact. By subsampling and partitioning
features strategically, we are able to achieve state-of-the-art results on 5 major action recognition datasets using relatively small visual vocabularies.
Another promising approach to recognise human actions first encodes the action sequence via a generative dynamical model. However, using classical distances for their classication does not necessarily deliver good results. Therefore we propose a general framework for learning distance functions between dynamical models, given a training set of labelled videos. The optimal distance function is selected among a family of `pullback' ones, induced by a parametrised mapping of the space of models. We focus here on hidden Markov models and their model space, and show how pullback distance learning greatly improves action recognition performances with respect to base distances.
Finally, the action classication systems that use a single global representation for each video clip are tailored for oine batch classication benchmarks. For human-robot interaction however, current systems fall short, either because they can only detect one human action per video frame, or because they assume the video is available ahead of time. In this work we propose an online human action detection system that can incrementally detect multiple concurrent space-time actions. In this way, it becomes possible to learn new action classes on-the-fly, allowing multiple people to actively teach and interact
with a robot
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..
Policy-Gradient Algorithms for Partially Observable Markov Decision Processes
Partially observable Markov decision processes are interesting because of their ability to model most conceivable real-world learning problems, for example, robot navigation, driving a car, speech recognition, stock trading, and playing games. The downside of this generality is that exact algorithms are computationally intractable. Such computational complexity motivates approximate approaches. One such class of algorithms are the so-called policy-gradient methods from reinforcement learning. They seek to adjust the parameters of an agent in the direction that maximises the long-term average of a reward signal. Policy-gradient methods are attractive as a \emph{scalable} approach for controlling partially observable Markov decision processes (POMDPs). In the most general case POMDP policies require some form of internal state, or memory, in order to act optimally. Policy-gradient methods have shown promise for problems admitting memory-less policies but have been less successful when memory is required. This thesis develops several improved algorithms for learning policies with memory in an infinite-horizon setting. Directly, when the dynamics of the world are known, and via Monte-Carlo methods otherwise. The algorithms simultaneously learn how to act and what to remember. ..