737 research outputs found
The Dialog State Tracking Challenge Series: A Review
In a spoken dialog system, dialog state tracking refers to the task of correctly inferring the state of the conversation -- such as the user's goal -- given all of the dialog history up to that turn. Dialog state tracking is crucial to the success of a dialog system, yet until recently there were no common resources, hampering progress. The Dialog State Tracking Challenge series of 3 tasks introduced the first shared testbed and evaluation metrics for dialog state tracking, and has underpinned three key advances in dialog state tracking: the move from generative to discriminative models; the adoption of discriminative sequential techniques; and the incorporation of the speech recognition results directly into the dialog state tracker. This paper reviews this research area, covering both the challenge tasks themselves and summarizing the work they have enabled
Recommended from our members
Data-Driven Policy Optimisation for Multi-Domain Task-Oriented Dialogue
Recent developments in machine learning along with a general shift in the public attitude towards digital personal assistants has opened new frontiers for conversational systems. Nevertheless, building data-driven multi-domain conversational agents that act optimally given a dialogue context is an open challenge. The first step towards that goal is developing an efficient way of learning a dialogue policy in new domains. Secondly, it is important to have the ability to collect and utilise human-human conversational data to bootstrap an agent's knowledge. The work presented in this thesis demonstrates how a neural dialogue manager fine-tuned with reinforcement learning presents a viable approach for learning a dialogue policy efficiently and across many domains.
The thesis starts by introducing a dialogue management module that learns through interactions to act optimally given a current context of a conversation. The current shift towards neural, parameter-rich systems does not fully address the problem of error noise coming from speech recognition or natural language understanding components. A Bayesian approach is therefore proposed to learn more robust and effective policy management in direct interactions without any prior data. By putting a distribution over model weights, the learning agent is less prone to overfit to particular dialogue realizations and a more efficient exploration policy can be therefore employed. The results show that deep reinforcement learning performs on par with non-parametric models even in a low data regime while significantly reducing the computational complexity compared with the previous state-of-the-art.
The deployment of a dialogue manager without any pre-training on human conversations is not a viable option from an industry perspective. However, the progress in building statistical systems, particularly dialogue managers, is hindered by the scale of data available. To address this fundamental obstacle, a novel data-collection pipeline entirely based on crowdsourcing without the need for hiring professional annotators is introduced. The validation of the approach results in the collection of the Multi-Domain Wizard-of-Oz dataset (MultiWOZ), a fully labeled collection of human-human written conversations spanning over multiple domains and topics. The proposed dataset creates a set of new benchmarks (belief tracking, policy optimisation, and response generation) significantly raising the complexity of analysed dialogues.
The collected dataset serves as a foundation for a novel reinforcement learning (RL)-based approach for training a multi-domain dialogue manager. A Multi-Action and Slot Dialogue Agent (MASDA) is proposed to combat some limitations: 1) handling complex multi-domain dialogues with multiple concurrent actions present in a single turn; and 2) lack of interpretability, which consequently impedes the use of intermediate signals (e.g., dialogue turn annotations) if such signals are available. MASDA explicitly models system acts and slots using intermediate signals, resulting in an improved task-based end-to-end framework. The model can also select concurrent actions in a single turn, thus enriching the representation of the generated responses. The proposed framework allows for RL training of dialogue task completion metrics when dealing with concurrent actions. The results demonstrate the advantages of both 1) handling concurrent actions and 2) exploiting intermediate signals: MASDA outperforms previous end-to-end frameworks while also offering improved scalability.EPSR
Recommended from our members
Discriminative methods for statistical spoken dialogue systems
Dialogue promises a natural and effective method for users to interact with and obtain information from computer systems. Statistical spoken dialogue systems are able to disambiguate in the presence of errors by maintaining probability distributions over what they believe to be the state of a dialogue. However, traditionally these distributions have been derived using generative models, which do not directly optimise for the criterion of interest and cannot easily exploit arbitrary information that may potentially be useful. This thesis presents how discriminative methods can overcome these problems in Spoken Language Understanding (SLU) and Dialogue State Tracking (DST).
A robust method for SLU is proposed, based on features extracted from the full posterior distribution of recognition hypotheses encoded in the form of word confusion networks. This method uses discriminative classifiers, trained on unaligned input/output pairs. Performance is evaluated on both an off-line corpus, and on-line in a live user trial. It is shown that a statistical discriminative approach to SLU operating on the full posterior ASR output distribution can substantially improve performance in terms of both accuracy and overall dialogue reward. Furthermore, additional gains can be obtained by incorporating features from the system's output.
For DST, a new word-based tracking method is presented that maps directly from the speech recognition results to the dialogue state without using an explicit semantic decoder. The method is based on a recurrent neural network structure that is capable of generalising to unseen dialogue state hypotheses, and requires very little feature engineering. The method is evaluated in the second and third Dialog State Tracking Challenges, as well as in a live user trial. The results demonstrate consistently high performance across all of the off-line metrics and a substantial increase in the quality of the dialogues in the live trial. The proposed method is shown to be readily applied to expanding dialogue domains, by exploiting robust features and a new method for online unsupervised adaptation. It is shown how the neural network structure can be adapted to output structured joint distributions, giving an improvement over estimating the dialogue state as a product of marginal distributions
A chatbot for automatic question answering in the information technology domain
Tese de mestrado, Engenharia Informática (Interação e Conhecimento), Universidade de Lisboa, Faculdade de Ciências, 2019Os chatbots têm sido alvo de grande estudo por parte da comunidade de Inteligência Artificial e de Processamento de Linguagem Natural e o seu futuro parece promissor. A ideia de automatizar conversas através do uso da tecnologia é bastante interessante para muitas empresas visto que fornece vários benefícios a um preço relativamente baixo. Por exemplo, pode dar apoio ao cliente a tempo inteiro, visto que consegue comunicar com um grande número de pessoas ao mesmo tempo e pode ser utilizado como ferramenta de automatização de trabalho repetitivo. No entanto, os sistemas atuais não conseguem por vezes acompanhar as expectativas cada vez mais exigentes dos utilizadores e por vezes falham em oferecer uma experiência tão simples e eficiente como gostaríamos. A falta de conjuntos de dados para treinar modelos é um dos principais problemas enfrentados pelos investigadores visto que para um conjunto de dados ser útil, precisa de ter um número de conversaçõees muito elevado. Outro problema frequente relaciona-se com a dificuldade em desenvolver chatbots capazes de criar diálogos convincentes, semelhantes aos que seriam feitos por humanos num determinado contexto. Esta dissertação tem por objectivo a construção de um chatbot que possa ser usado para responder a questões num ambiente de apoio informático. Este sistema deverá ser capaz de analisar uma determinada questão e, com base na informação com que foi treinado, devolver uma ou um conjunto de respostas correctas possíveis. Face ao atual interesse da comunidade de NLP relativo aos chatbots generativos, que dado um contexto, geram, usualmente palavra a palavra, as próprias respostas e devido à maior facilidade em adaptarem-se a novas perguntas que não existem no conjunto de dados de treino, este tipo
de chatbots foi escolhido como foco central do trabalho desenvolvido. Assim sendo, três chatbots generativos foram replicados ao terem sido treinados e avaliados. Estes chatbots têm como nome Hierarchical Recurrent Encoder-Decoder (HRED), Variational Hierarchical Recurrent Encoder Decoder (VHRED) e Variational Hierarchical Conversation RNNs (VHCR). De entre estes três modelos, o HRED é o mais simples, sendo bastante semelhante a um modelo Encoder-Decoder básico. Para além do Encoder e do Decoder, o modelo HRED utiliza uma Rede Neuronal Recorrente (RNR) adicional que mantém informações relacionadas com o contexto da conversa atual e que é utilizada para condicionar o output gerado pelo modelo. Por seu turno, tanto o modelo VHRED como o modelo VHCR são Variational Autoencoders, um tipo de sistemas generativos que nos últimos anos tem sido muito estudado por ter uma grande capacidade de gerar novos dados. Embora inicialmente aplicado em geração de imagens, este tipo de modelos foi aplicado ao contexto de Processamento de Linguagem Natural através do modelo VHRED, com o objectivo de ser um modelo capaz de gerar frases mais diversificadas do que os restantes modelos existentes e de conseguir capturar a informação global do conjunto de dados de treino. O VHCR, por sua vez, assume-se como uma extensão do modelo VHRED ao ter uma estrutura muito idêntica a este e foi proposto para mitigar um problema que o VHRED tem relacionado com um défice na utilização correcta de uma variável latente fulcral à obtenção dos resultados pretendidos. Outra contribuição desta dissertação é a criação de uma ferramenta que permita extrair novos conjuntos de dados constituídos por diálogos entre humanos. Esta ferramenta utiliza o website Reddit, onde diariamente milhares de utilizadores partilham conteúdo em forma de perguntas, artigos e links, como fonte dos diálogos extraídos. Mais concretamente, esta ferramenta cria diálogos através da interação com bases de dados que são disponibilizadas online mensalmente. Um aspeto interessante desta ferramenta é o facto de permitir que conjuntos de dados de uma grande variedade de domínios sejam extraídos, o que permitirá obter conjuntos de dados relacionados com domínios nos quais ainda não exista nenhum. Um dos conjuntos de dados que foram extraídos recorrendo a esta ferramenta é genericamente focado no domínio da informática e expresso na língua inglesa, abordando vários temas distintos, entre os quais o Ubuntu e questões relativas a linguagens de programação. Este conjunto de dados tem como nome askIT e foi utilizado para treinar os modelos anteriormente referidos. Outro conjunto de dados extraído é maioritariamente composto por diálogos expressos na língua portuguesa e tem como nome Portuguese. Ao contrário do conjunto de dados anterior, este não se foca em nenhum assunto particular, e inclui diálogos sobre assuntos muito variados entre os quais cultura e atualidades. Para além destes conjuntos de dados extraídos por mim, o Ubuntu Dialogue Corpus foi também utilizado para treinar os modelos. Este conjunto de dados é uma coleção de diálogos relativos a apoio técnico sobre o Ubuntu e foi escolhido por ser de grandes
dimensões e por abordar um tema semelhante ao pretendido para o meu chatbot. Para avaliar os modelos referidos, duas formas de avaliação distintas foram utilizadas: uma baseada em representação semântica vetorial (embeddings), extrínseca ao modelo, e outra baseada na perplexidade de palavras, intrínseca ao modelo. A avaliação de modelos baseada em embeddings, tal como o nome indica, foca-se na análise e comparação de embeddings, que são vetores representativos de uma palavra ou frase. Cada embedding tenta capturar o significado da frase ou palavra correspondente, o que leva a que frases ou palavras semelhantes sejam representadas por embeddings semelhantes. Assim sendo, nesta avaliação, os embeddings das frases geradas pelos modelos são comparados com os embeddings das frases de referência existentes no conjunto de dados de teste. Por sua vez, a avaliação baseada na perplexidade mede a surpresa que o modelo treinado tem a prever o conjunto de dados de teste. Para além destes dois tipos de avaliação e através de um programa criado por mim para
interagir diretamente com o modelo, foram feitas algumas questões ao VHCR treinado tanto com o Ubuntu Dialogue Corpus como com o askIT com o fim de fazer um juízo sobre as suas capacidades. Através destas avaliaçõees, foi possível observar que os modelos treinados com o
Ubuntu Dialogue Corpus obtiveram melhores resultados na avaliação extrínseca, enquanto que os treinados com o askIT obtiveram melhores resultados na avaliação intrínseca. No entanto, com as respostas dadas pelo VHCR, concluiu-se que para o objectivo de construir um chatbot capaz de responder a questões num ambiente de apoio informático, os modelos treinados com o Ubuntu Dialogue Corpus seriam os mais adequados. Concluiu-se também que para este tipo de fins, os chatbots baseados em recuperação de informação continuam a ser os mais adequados porque as respostas geradas pelos chatbots generativos não parecem estar ainda num nível competitivo.Recently, chatbots have been thoroughly studied by the Artificial Intelligence and the Natural Language Processing communities and the future of this technology appears promising. The idea of automating and scaling one-to-one conversations using technology appeals to companies since it can provide benefits in a cost effective way. However, current systems sometimes cannot keep up with the increasingly demanding user expectations as they sometimes fail to deliver experiences that are as seamless and efficient as we envisioned them to be. The lack of datasets to train models is one of the biggest problems faced by researchers
since, for a dataset to be useful, it needs to have a very large number of conversations. Another problem researchers commonly face relates to the difficulty of developing a chatbot
that is capable of generating convincing dialogues, similar to what a human would say in a given context. For this dissertation, I developed a tool that is capable of extracting new datasets containing dialogues between humans. This tool uses Reddit, a website in which thousands of users share content daily, as source and allows the creation of datasets related to a wide number of domains. Additionally, three state-of-the-art dialogue models were replicated and trained on two datasets of the information technology domain, one of which was extracted by the abovementioned tool. Two types of evaluation were conducted, one intrinsic to the models and the other extrinsic, and the results obtained are in line with the results reported by their original authors
Automatic recognition of multiparty human interactions using dynamic Bayesian networks
Relating statistical machine learning approaches to the automatic analysis of multiparty
communicative events, such as meetings, is an ambitious research area. We
have investigated automatic meeting segmentation both in terms of “Meeting Actions”
and “Dialogue Acts”. Dialogue acts model the discourse structure at a fine
grained level highlighting individual speaker intentions. Group meeting actions describe
the same process at a coarse level, highlighting interactions between different
meeting participants and showing overall group intentions.
A framework based on probabilistic graphical models such as dynamic Bayesian
networks (DBNs) has been investigated for both tasks. Our first set of experiments
is concerned with the segmentation and structuring of meetings (recorded using
multiple cameras and microphones) into sequences of group meeting actions such
as monologue, discussion and presentation. We outline four families of multimodal
features based on speaker turns, lexical transcription, prosody, and visual motion
that are extracted from the raw audio and video recordings. We relate these lowlevel
multimodal features to complex group behaviours proposing a multistreammodelling
framework based on dynamic Bayesian networks. Later experiments are
concerned with the automatic recognition of Dialogue Acts (DAs) in multiparty
conversational speech. We present a joint generative approach based on a switching
DBN for DA recognition in which segmentation and classification of DAs are
carried out in parallel. This approach models a set of features, related to lexical
content and prosody, and incorporates a weighted interpolated factored language
model. In conjunction with this joint generative model, we have also investigated
the use of a discriminative approach, based on conditional random fields, to perform
a reclassification of the segmented DAs.
The DBN based approach yielded significant improvements when applied both
to the meeting action and the dialogue act recognition task. On both tasks, the DBN
framework provided an effective factorisation of the state-space and a flexible infrastructure
able to integrate a heterogeneous set of resources such as continuous
and discrete multimodal features, and statistical language models. Although our
experiments have been principally targeted on multiparty meetings; features, models,
and methodologies developed in this thesis can be employed for a wide range
of applications. Moreover both group meeting actions and DAs offer valuable insights about the current conversational context providing valuable cues and features
for several related research areas such as speaker addressing and focus of attention
modelling, automatic speech recognition and understanding, topic and decision detection
Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
In this paper, we perform an exhaustive evaluation of different
representations to address the intent classification problem in a Spoken
Language Understanding (SLU) setup. We benchmark three types of systems to
perform the SLU intent detection task: 1) text-based, 2) lattice-based, and a
novel 3) multimodal approach. Our work provides a comprehensive analysis of
what could be the achievable performance of different state-of-the-art SLU
systems under different circumstances, e.g., automatically- vs.
manually-generated transcripts. We evaluate the systems on the publicly
available SLURP spoken language resource corpus. Our results indicate that
using richer forms of Automatic Speech Recognition (ASR) outputs, namely
word-consensus-networks, allows the SLU system to improve in comparison to the
1-best setup (5.5% relative improvement). However, crossmodal approaches, i.e.,
learning from acoustic and text embeddings, obtains performance similar to the
oracle setup, a relative improvement of 17.8% over the 1-best configuration,
being a recommended alternative to overcome the limitations of working with
automatically generated transcripts.Comment: Accepted in ICASSP 202
- …