11 research outputs found
Discourse relations and conjoined VPs: automated sense recognition
Sense classification of discourse relations is a sub-task of shallow discourse parsing. Discourse relations can occur both across sentences (inter-sentential) and within sentences (intra-sentential), and more than one discourse relation can hold between the same units. Using a newly available corpus of discourse-annotated intra-sentential conjoined verb phrases, we demonstrate a sequential classification system for their multi-label sense classification. We assess the importance of each feature used in the classification, the feature scope, and what is lost in moving from gold standard manual parses to the output of an off-the-shelf parser
Promptly Predicting Structures: The Return of Inference
Prompt-based methods have been used extensively across NLP to build zero- and
few-shot label predictors. Many NLP tasks are naturally structured: that is,
their outputs consist of multiple labels which constrain each other. Annotating
data for such tasks can be cumbersome. Can the promise of the prompt-based
paradigm be extended to such structured outputs? In this paper, we present a
framework for constructing zero- and few-shot linguistic structure predictors.
Our key insight is that we can use structural constraints -- and combinatorial
inference derived from them -- to filter out inconsistent structures predicted
by large language models. We instantiated this framework on two structured
prediction tasks, and five datasets. Across all cases, our results show that
enforcing consistency not only constructs structurally valid outputs, but also
improves performance over the unconstrained variants.Comment: 19 pages, 13 figures Accepted to NAACL'2024 (Main
Retrieving Texts based on Abstract Descriptions
While instruction-tuned Large Language Models (LLMs) excel at extracting
information from text, they are not suitable for locating texts conforming to a
given description in a large document collection (semantic retrieval).
Similarity search over embedding vectors does allow to perform retrieval by
query, but the similarity reflected in the embedding is ill-defined and
non-consistent, and is sub-optimal for many use cases. What, then, is a good
query representation for effective retrieval?
We identify the well defined and consistent task of retrieving sentences
based on abstract descriptions of their content. We demonstrate the inadequacy
of current text embeddings and propose an alternative model that significantly
improves when used in standard nearest neighbor search. The model is trained
using positive and negative pairs sourced through prompting a LLM. While it is
easy to source the training material from an LLM, the retrieval task cannot be
performed by the LLM directly. This demonstrates that data from LLMs can be
used not only for distilling more efficient specialized models than the
original LLM, but also for creating new capabilities not immediately possible
using the original model.Comment: A preprin
QASem Parsing: Text-to-text Modeling of QA-based Semantics
Several recent works have suggested to represent semantic relations with
questions and answers, decomposing textual information into separate
interrogative natural language statements. In this paper, we consider three
QA-based semantic tasks - namely, QA-SRL, QANom and QADiscourse, each targeting
a certain type of predication - and propose to regard them as jointly providing
a comprehensive representation of textual information. To promote this goal, we
investigate how to best utilize the power of sequence-to-sequence (seq2seq)
pre-trained language models, within the unique setup of semi-structured
outputs, consisting of an unordered set of question-answer pairs. We examine
different input and output linearization strategies, and assess the effect of
multitask learning and of simple data augmentation techniques in the setting of
imbalanced training data. Consequently, we release the first unified QASem
parsing tool, practical for downstream applications who can benefit from an
explicit, QA-based account of information units in a text
Design Choices for Crowdsourcing Implicit Discourse Relations: Revealing the Biases Introduced by Task Design
Disagreement in natural language annotation has mostly been studied from a perspective of biases introduced by the annotators and the annotation frameworks. Here, we propose to analyze another source of bias—task design bias, which has a particularly strong impact on crowdsourced linguistic annotations where natural language is used to elicit the interpretation of lay annotators. For this purpose we look at implicit discourse relation annotation, a task that has repeatedly been shown to be difficult due to the relations’ ambiguity. We compare the annotations of 1,200 discourse relations obtained using two distinct annotation tasks and quantify the biases of both methods across four different domains. Both methods are natural language annotation tasks designed for crowdsourcing. We show that the task design can push annotators towards certain relations and that some discourse relation senses can be better elicited with one or the other annotation approach. We also conclude that this type of bias should be taken into account when training and testing models
Camels in a Changing Climate: Enhancing LM Adaptation with Tulu 2
Since the release of T\"ULU [Wang et al., 2023b], open resources for
instruction tuning have developed quickly, from better base models to new
finetuning techniques. We test and incorporate a number of these advances into
T\"ULU, resulting in T\"ULU 2, a suite of improved T\"ULU models for advancing
the understanding and best practices of adapting pretrained language models to
downstream tasks and user preferences. Concretely, we release: (1)
T\"ULU-V2-mix, an improved collection of high-quality instruction datasets; (2)
T\"ULU 2, LLAMA-2 models finetuned on the V2 mixture; (3) T\"ULU 2+DPO, T\"ULU
2 models trained with direct preference optimization (DPO), including the
largest DPO-trained model to date (T\"ULU 2+DPO 70B); (4) CODE T\"ULU 2, CODE
LLAMA models finetuned on our V2 mix that outperform CODE LLAMA and its
instruction-tuned variant, CODE LLAMA-Instruct. Our evaluation from multiple
perspectives shows that the T\"ULU 2 suite achieves state-of-the-art
performance among open models and matches or exceeds the performance of
GPT-3.5-turbo-0301 on several benchmarks. We release all the checkpoints, data,
training and evaluation code to facilitate future open efforts on adapting
large language models.Comment: technical report; fixed zephyr number
Value Kaleidoscope: Engaging AI with Pluralistic Human Values, Rights, and Duties
Human values are crucial to human decision-making. Value pluralism is the
view that multiple correct values may be held in tension with one another
(e.g., when considering lying to a friend to protect their feelings, how does
one balance honesty with friendship?). As statistical learners, AI systems fit
to averages by default, washing out these potentially irreducible value
conflicts. To improve AI systems to better reflect value pluralism, the
first-order challenge is to explore the extent to which AI systems can model
pluralistic human values, rights, and duties as well as their interaction.
We introduce ValuePrism, a large-scale dataset of 218k values, rights, and
duties connected to 31k human-written situations. ValuePrism's contextualized
values are generated by GPT-4 and deemed high-quality by human annotators 91%
of the time. We conduct a large-scale study with annotators across diverse
social and demographic backgrounds to try to understand whose values are
represented.
With ValuePrism, we build Kaleido, an open, light-weight, and structured
language-based multi-task model that generates, explains, and assesses the
relevance and valence (i.e., support or oppose) of human values, rights, and
duties within a specific context. Humans prefer the sets of values output by
our system over the teacher GPT-4, finding them more accurate and with broader
coverage. In addition, we demonstrate that Kaleido can help explain variability
in human decision-making by outputting contrasting values. Finally, we show
that Kaleido's representations transfer to other philosophical frameworks and
datasets, confirming the benefit of an explicit, modular, and interpretable
approach to value pluralism. We hope that our work will serve as a step to
making more explicit the implicit values behind human decision-making and to
steering AI systems to make decisions that are more in accordance with them
Draw Me a Flower: Grounding Formal Abstract Structures Stated in Informal Natural Language
Forming and interpreting abstraction is a core process in human
communication. In particular, when giving and performing complex instructions
stated in natural language (NL), people may naturally evoke abstract constructs
such as objects, loops, conditions and functions to convey their intentions in
an efficient and precise way. Yet, interpreting and grounding abstraction
stated in NL has not been systematically studied in NLP/AI. To elicit
naturally-occurring abstractions in NL we develop the Hexagons referential
game, where players describe increasingly complex images on a two-dimensional
Hexagons board, and other players need to follow these instructions to recreate
the images. Using this game we collected the Hexagons dataset, which consists
of 164 images and over 3000 naturally-occurring instructions, rich with diverse
abstractions. Results of our baseline models on an instruction-to-execution
task derived from the Hexagons dataset confirm that higher-level abstractions
in NL are indeed more challenging for current systems to process. Thus, this
dataset exposes a new and challenging dimension for grounded semantic parsing,
and we propose it for the community as a future benchmark to explore more
sophisticated and high-level communication within NLP applications