96 research outputs found
CoheSentia: A Novel Benchmark of Incremental versus Holistic Assessment of Coherence in Generated Texts
Coherence is a linguistic term that refers to the relations between small
textual units (sentences, propositions), which make the text logically
consistent and meaningful to the reader. With the advances of generative
foundational models in NLP, there is a pressing need to automatically assess
the human-perceived coherence of automatically generated texts. Up until now,
little work has been done on explicitly assessing the coherence of generated
texts and analyzing the factors contributing to (in)coherence. Previous work on
the topic used other tasks, e.g., sentence reordering, as proxies of coherence,
rather than approaching coherence detection heads on. In this paper, we
introduce {\sc CoheSentia}, a novel benchmark of human-perceived coherence of
automatically generated texts. Our annotation protocol reflects two
perspectives; one is global, assigning a single coherence score, and the other
is incremental, scoring sentence by sentence. The incremental method produces
an (in)coherence score for each text fragment and also pinpoints reasons for
incoherence at that point. Our benchmark contains 500 automatically-generated
and human-annotated paragraphs, each annotated in both methods, by multiple
raters. Our analysis shows that the inter-annotator agreement in the
incremental mode is higher than in the holistic alternative, and our
experiments show that standard LMs fine-tuned for coherence detection show
varied performance on the different factors contributing to (in)coherence. All
in all, these models yield unsatisfactory performance, emphasizing the need for
developing more reliable methods for coherence assessment
ChatGPT and Simple Linguistic Inferences: Blind Spots and Blinds
This paper sheds light on the limitations of ChatGPT's understanding
capabilities, focusing on simple inference tasks that are typically easy for
humans but appear to be challenging for the model. Specifically, we target (i)
grammatically-specified entailments, (ii) premises with evidential adverbs of
uncertainty, and (iii) monotonicity entailments. We present expert-designed
evaluation sets for these inference types and conduct experiments in a
zero-shot setup. Our results show that the model struggles with these types of
inferences, exhibiting moderate to low accuracy. Moreover, while ChatGPT
demonstrates knowledge of the underlying linguistic concepts when prompted
directly, it often fails to incorporate this knowledge to make correct
inferences. Even more strikingly, further experiments show that embedding the
premise under presupposition triggers or non-factive verbs causes the model to
predict entailment more frequently {regardless} of the correct semantic label.
Overall these results suggest that, despite GPT's celebrated language
understanding capacity, ChatGPT has blindspots with respect to certain types of
entailment, and that certain entailment-cancelling features act as ``blinds''
overshadowing the semantics of the embedded premise. Our analyses emphasize the
need for further research into the linguistic comprehension and reasoning
capabilities of LLMs, in order to improve their reliability, and establish
their trustworthiness for real-world applications
Morphological Inflection with Phonological Features
Recent years have brought great advances into solving morphological tasks,
mostly due to powerful neural models applied to various tasks as (re)inflection
and analysis. Yet, such morphological tasks cannot be considered solved,
especially when little training data is available or when generalizing to
previously unseen lemmas. This work explores effects on performance obtained
through various ways in which morphological models get access to subcharacter
phonological features that are the targets of morphological processes. We
design two methods to achieve this goal: one that leaves models as is but
manipulates the data to include features instead of characters, and another
that manipulates models to take phonological features into account when
building representations for phonemes. We elicit phonemic data from standard
graphemic data using language-specific grammars for languages with shallow
grapheme-to-phoneme mapping, and we experiment with two reinflection models
over eight languages. Our results show that our methods yield comparable
results to the grapheme-based baseline overall, with minor improvements in some
of the languages. All in all, we conclude that patterns in character
distributions are likely to allow models to infer the underlying phonological
characteristics, even when phonemes are not explicitly represented.Comment: ACL 2023 main conference; 8 pages, 1 figur
pyBART: Evidence-based Syntactic Transformations for IE
Syntactic dependencies can be predicted with high accuracy, and are useful
for both machine-learned and pattern-based information extraction tasks.
However, their utility can be improved. These syntactic dependencies are
designed to accurately reflect syntactic relations, and they do not make
semantic relations explicit. Therefore, these representations lack many
explicit connections between content words, that would be useful for downstream
applications. Proposals like English Enhanced UD improve the situation by
extending universal dependency trees with additional explicit arcs. However,
they are not available to Python users, and are also limited in coverage. We
introduce a broad-coverage, data-driven and linguistically sound set of
transformations, that makes event-structure and many lexical relations
explicit. We present pyBART, an easy-to-use open-source Python library for
converting English UD trees either to Enhanced UD graphs or to our
representation. The library can work as a standalone package or be integrated
within a spaCy NLP pipeline. When evaluated in a pattern-based relation
extraction scenario, our representation results in higher extraction scores
than Enhanced UD, while requiring fewer patterns.Comment: Accepted ACL2020 system demonstration pape
Is Probing All You Need? Indicator Tasks as an Alternative to Probing Embedding Spaces
The ability to identify and control different kinds of linguistic information
encoded in vector representations of words has many use cases, especially for
explainability and bias removal. This is usually done via a set of simple
classification tasks, termed probes, to evaluate the information encoded in the
embedding space. However, the involvement of a trainable classifier leads to
entanglement between the probe's results and the classifier's nature. As a
result, contemporary works on probing include tasks that do not involve
training of auxiliary models. In this work we introduce the term indicator
tasks for non-trainable tasks which are used to query embedding spaces for the
existence of certain properties, and claim that this kind of tasks may point to
a direction opposite to probes, and that this contradiction complicates the
decision on whether a property exists in an embedding space. We demonstrate our
claims with two test cases, one dealing with gender debiasing and another with
the erasure of morphological information from embedding spaces. We show that
the application of a suitable indicator provides a more accurate picture of the
information captured and removed compared to probes. We thus conclude that
indicator tasks should be implemented and taken into consideration when
eliciting information from embedded representations.Comment: Findings of EMNLP 202
MRL Parsing Without Tears: The Case of Hebrew
Syntactic parsing remains a critical tool for relation extraction and
information extraction, especially in resource-scarce languages where LLMs are
lacking. Yet in morphologically rich languages (MRLs), where parsers need to
identify multiple lexical units in each token, existing systems suffer in
latency and setup complexity. Some use a pipeline to peel away the layers:
first segmentation, then morphology tagging, and then syntax parsing; however,
errors in earlier layers are then propagated forward. Others use a joint
architecture to evaluate all permutations at once; while this improves
accuracy, it is notoriously slow. In contrast, and taking Hebrew as a test
case, we present a new "flipped pipeline": decisions are made directly on the
whole-token units by expert classifiers, each one dedicated to one specific
task. The classifiers are independent of one another, and only at the end do we
synthesize their predictions. This blazingly fast approach sets a new SOTA in
Hebrew POS tagging and dependency parsing, while also reaching near-SOTA
performance on other Hebrew NLP tasks. Because our architecture does not rely
on any language-specific resources, it can serve as a model to develop similar
parsers for other MRLs
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts
We propose a modular framework that leverages the expertise of different
foundation models over different modalities and domains in order to perform a
single, complex, multi-modal task, without relying on prompt engineering or
otherwise tailor-made multi-modal training. Our approach enables decentralized
command execution and allows each model to both contribute and benefit from the
expertise of the other models. Our method can be extended to a variety of
foundation models (including audio and vision), above and beyond only language
models, as it does not depend on prompts. We demonstrate our approach on two
tasks. On the well-known task of stylized image captioning, our experiments
show that our approach outperforms semi-supervised state-of-the-art models,
while being zero-shot and avoiding costly training, data collection, and prompt
engineering. We further demonstrate this method on a novel task, audio-aware
image captioning, in which an image and audio are given and the task is to
generate text that describes the image within the context of the provided
audio. Our code is available on GitHub.Comment: GitHub: https://github.com/danielabd/Apollo-Ca
- …