59 research outputs found
The Entity-Deduction Arena: A playground for probing the conversational reasoning and planning capabilities of LLMs
Large language models (LLMs) are effective at answering questions that are
clearly asked. However, when faced with ambiguous queries they can act
unpredictably and produce incorrect outputs. This underscores the need for the
development of intelligent agents capable of asking clarification questions to
resolve ambiguities effectively. This capability requires complex
understanding, state tracking, reasoning and planning over multiple
conversational turns. However, directly measuring this can be challenging. In
this paper, we offer a surrogate problem which assesses an LLMs's capability to
deduce an entity unknown to itself, but revealed to a judge, by asking the
judge a series of queries. This entity-deducing game can serve as an evaluation
framework to probe the conversational reasoning and planning capabilities of
language models. We systematically evaluate various LLMs and discover
significant differences in their performance on this task. We find that strong
LLMs like GPT-4 outperform human players by a large margin. We further employ
Behavior Cloning (BC) to examine whether a weaker model is capable of imitating
a stronger model and generalizing to data or domains, using only the
demonstrations from a stronger model. We finally propose to use Reinforcement
Learning to enhance reasoning and planning capacity of Vicuna models through
episodes of game playing, which lead to significant performance improvement. We
hope that this problem offers insights into how autonomous agents could be
trained to behave more intelligently in ambiguous circumstances.Comment: 22 page
More Speaking or More Speakers?
Self-training (ST) and self-supervised learning (SSL) methods have
demonstrated strong improvements in automatic speech recognition (ASR). In
spite of these advances, to the best of our knowledge, there is no analysis of
how the composition of the labelled and unlabelled datasets used in these
methods affects the results. In this work we aim to analyse the effect of
numbers of speakers in the training data on a recent SSL algorithm (wav2vec
2.0), and a recent ST algorithm (slimIPL). We perform a systematic analysis on
both labeled and unlabeled data by varying the number of speakers while keeping
the number of hours fixed and vice versa. Our findings suggest that SSL
requires a large amount of unlabeled data to produce high accuracy results,
while ST requires a sufficient number of speakers in the labelled data,
especially in the low-regime setting. In this manner these two approaches
improve supervised learning in different regimes of dataset composition
Learning Hard Alignments with Variational Inference
There has recently been significant interest in hard attention models for
tasks such as object recognition, visual captioning and speech recognition.
Hard attention can offer benefits over soft attention such as decreased
computational cost, but training hard attention models can be difficult because
of the discrete latent variables they introduce. Previous work used REINFORCE
and Q-learning to approach these issues, but those methods can provide
high-variance gradient estimates and be slow to train. In this paper, we tackle
the problem of learning hard attention for a sequential task using variational
inference methods, specifically the recently introduced VIMCO and NVIL.
Furthermore, we propose a novel baseline that adapts VIMCO to this setting. We
demonstrate our method on a phoneme recognition task in clean and noisy
environments and show that our method outperforms REINFORCE, with the
difference being greater for a more complicated task
- …