7 research outputs found
Answering Unseen Questions With Smaller Language Models Using Rationale Generation and Dense Retrieval
When provided with sufficient explanatory context, smaller Language Models
have been shown to exhibit strong reasoning ability on challenging short-answer
question-answering tasks where the questions are unseen in training. We
evaluate two methods for further improvement in this setting. Both methods
focus on combining rationales generated by a larger Language Model with longer
contexts created from a multi-hop dense retrieval system. The first method
() involves training a Rationale Ranking model to score both
generated rationales and retrieved contexts with respect to relevance and
truthfulness. We then use the scores to derive combined contexts from both
knowledge sources using a number of combinatory strategies. For the second
method () we train a smaller Reasoning model using
retrieval-augmented training datasets such that it becomes proficient at
utilising relevant information from longer text sequences that may be only
partially evidential and frequently contain many irrelevant sentences.
Generally we find that both methods are effective but that the
method is more straightforward to apply and produces the strongest results in
the unseen setting on which we focus. Our single best Reasoning model using
only 440 million parameters materially improves upon strong comparable prior
baselines for unseen evaluation datasets (StrategyQA 58.9 61.7
acc., CommonsenseQA 63.6 72.7 acc., ARC-DA 31.6
52.1 F1, IIRC 25.5 27.3 F1) and a version utilising our prior
knowledge of each type of question in selecting a context combination strategy
does even better. Our proposed models also generally outperform direct prompts
against much larger models (BLOOM 175B and StableVicuna 13B) in both few-shot
chain-of-thought and few-shot answer-only settings
The Entity-Deduction Arena: A playground for probing the conversational reasoning and planning capabilities of LLMs
Large language models (LLMs) are effective at answering questions that are
clearly asked. However, when faced with ambiguous queries they can act
unpredictably and produce incorrect outputs. This underscores the need for the
development of intelligent agents capable of asking clarification questions to
resolve ambiguities effectively. This capability requires complex
understanding, state tracking, reasoning and planning over multiple
conversational turns. However, directly measuring this can be challenging. In
this paper, we offer a surrogate problem which assesses an LLMs's capability to
deduce an entity unknown to itself, but revealed to a judge, by asking the
judge a series of queries. This entity-deducing game can serve as an evaluation
framework to probe the conversational reasoning and planning capabilities of
language models. We systematically evaluate various LLMs and discover
significant differences in their performance on this task. We find that strong
LLMs like GPT-4 outperform human players by a large margin. We further employ
Behavior Cloning (BC) to examine whether a weaker model is capable of imitating
a stronger model and generalizing to data or domains, using only the
demonstrations from a stronger model. We finally propose to use Reinforcement
Learning to enhance reasoning and planning capacity of Vicuna models through
episodes of game playing, which lead to significant performance improvement. We
hope that this problem offers insights into how autonomous agents could be
trained to behave more intelligently in ambiguous circumstances.Comment: 22 page
VNHSGE: VietNamese High School Graduation Examination Dataset for Large Language Models
The VNHSGE (VietNamese High School Graduation Examination) dataset, developed
exclusively for evaluating large language models (LLMs), is introduced in this
article. The dataset, which covers nine subjects, was generated from the
Vietnamese National High School Graduation Examination and comparable tests.
300 literary essays have been included, and there are over 19,000
multiple-choice questions on a range of topics. The dataset assesses LLMs in
multitasking situations such as question answering, text generation, reading
comprehension, visual question answering, and more by including both textual
data and accompanying images. Using ChatGPT and BingChat, we evaluated LLMs on
the VNHSGE dataset and contrasted their performance with that of Vietnamese
students to see how well they performed. The results show that ChatGPT and
BingChat both perform at a human level in a number of areas, including
literature, English, history, geography, and civics education. They still have
space to grow, though, especially in the areas of mathematics, physics,
chemistry, and biology. The VNHSGE dataset seeks to provide an adequate
benchmark for assessing the abilities of LLMs with its wide-ranging coverage
and variety of activities. We intend to promote future developments in the
creation of LLMs by making this dataset available to the scientific community,
especially in resolving LLMs' limits in disciplines involving mathematics and
the natural sciences.Comment: 74 pages, 44 figure
Pseudo-contractions as Gentle Repairs
Updating a knowledge base to remove an unwanted consequence is a challenging task. Some of the original sentences must be either deleted or weakened in such a way that the sentence to be removed is no longer entailed by the resulting set. On the other hand, it is desirable that the existing knowledge be preserved as much as possible, minimising the loss of information. Several approaches to this problem can be found in the literature. In particular, when the knowledge is represented by an ontology, two different families of frameworks have been developed in the literature in the past decades with numerous ideas in common but with little interaction between the communities: applications of AGM-like Belief Change and justification-based Ontology Repair. In this paper, we investigate the relationship between pseudo-contraction operations and gentle repairs. Both aim to avoid the complete deletion of sentences when replacing them with weaker versions is enough to prevent the entailment of the unwanted formula. We show the correspondence between concepts on both sides and investigate under which conditions they are equivalent. Furthermore, we propose a unified notation for the two approaches, which might contribute to the integration of the two areas
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum