96 research outputs found
Generating Literal and Implied Subquestions to Fact-check Complex Claims
Verifying complex political claims is a challenging task, especially when
politicians use various tactics to subtly misrepresent the facts. Automatic
fact-checking systems fall short here, and their predictions like "half-true"
are not very useful in isolation, since we have no idea which parts of the
claim are true and which are not. In this work, we focus on decomposing a
complex claim into a comprehensive set of yes-no subquestions whose answers
influence the veracity of the claim. We present ClaimDecomp, a dataset of
decompositions for over 1000 claims. Given a claim and its verification
paragraph written by fact-checkers, our trained annotators write subquestions
covering both explicit propositions of the original claim and its implicit
facets, such as asking about additional political context that changes our view
of the claim's veracity. We study whether state-of-the-art models can generate
such subquestions, showing that these models generate reasonable questions to
ask, but predicting the comprehensive set of subquestions from the original
claim without evidence remains challenging. We further show that these
subquestions can help identify relevant evidence to fact-check the full claim
and derive the veracity through their answers, suggesting that they can be
useful pieces of a fact-checking pipeline
Using Natural Language Explanations to Rescale Human Judgments
The rise of large language models (LLMs) has brought a critical need for
high-quality human-labeled data, particularly for processes like human feedback
and evaluation. A common practice is to label data via consensus annotation
over crowdworker judgments. However, annotators' judgments for subjective tasks
can differ in many ways: they may have different qualitative judgments about an
example, and they may map those to a labeling scheme in different ways. We show
that these nuances can be captured by natural language explanations, and
propose a method to rescale ordinal annotations and explanations using LLMs.
Specifically, we feed annotators' Likert ratings and corresponding explanations
into an LLM and prompt it to produce a numeric score anchored in a scoring
rubric. These scores should reflect the annotators' underlying assessments of
the example. The rubric can be designed or modified after annotation, and
include distinctions that may not have been known when the original error
taxonomy was devised. We explore our technique in the context of rating system
outputs for a document-grounded question answering task, where LLMs achieve
near-human performance. Our method rescales the raw judgments without impacting
agreement and brings the scores closer to human judgments grounded in the same
scoring rubric.Comment: Data available at
https://github.com/ManyaWadhwa/explanation_based_rescalin
Recommended from our members
Building robust and modular question answering systems
Over the past few years, significant progress has been made in QA systems due to the availability of annotated datasets on a large scale and the impressive advancements in large-scale pre-trained language models. Despite these successes, the black-box nature of end-to-end trained QA systems makes them hard to interpret and control. When these systems encounter inputs that deviate from their training data distribution or are subjected to adversarial perturbations, their performance tends to deteriorate by a large margin. Furthermore, they may occasionally produce unanticipated results, potentially leading to confusion among users. Additionally, this deficiency in robustness and interpretability poses challenges when deploying such models in real-world scenarios.
In this dissertation, we aim to build robust QA systems by explicitly decomposing various QA tasks into distinct sub-modules, each responsible for a particular aspect of the overall QA process. Through this decomposition, we seek to achieve improved performance in terms of both the system's ability to handle diverse and challenging inputs (robustness) and its capacity to provide transparent and explainable reasoning (interpretability).
To address the aforementioned limitations, in this dissertation, we aim to build robust QA models by explicitly decomposing different QA tasks into different sub-modules. We argue that utilizing these sub-modules can substantially improve the robustness and interpretability of different QA systems. In the first half of this dissertation, we introduce three sub-modules to mitigate the dataset artifacts that models learn from datasets. These sub-modules also enable us to examine and exert explicit control over the intermediate outputs. In the first work, to address question answering that requires multi-hop reasoning, we propose a chain extractor, which extracts the reasoning chains necessary for models to derive the final answer. The reasoning chains not only prevent the model from exploiting reasoning shortcuts but also provide an explanation of how the answer is derived. In the second work, we incorporate an alignment layer between the question and the context before generating the answer. This alignment layer can help us interpret the models' behavior and improve the robustness of adversarial settings. In the third work, we add an answer verifier after QA models generate the answer. This verifier can boost QA models' prediction confidence across several different domains and help us spot cases where QA models predict the right answer for the wrong reason by utilizing the external NLI datasets and models.
In the second half of this dissertation, we tackle the problem of complex fact-checking in the real world by treating it as a modularized QA task. We first decompose a complex claim into several yes-no subquestions whose answer directly contributes to the veracity of the claim. Then, each sub-question is fed into a commercial search engine to retrieve relevant documents. Additionally, we extract the relevant snippets in the retrieved documents and use a GPT3-based summarizer to generate the core evidence for checking the claim. We show that the decompositions can play an important role in both evidence retrieval and veracity composition of an explainable fact-checking system. Also, we show the GPT3-based evidence summarizer generates faithful summaries of documents most of the time indicating it can be used as an
effective part of the pipeline. Moreover, we annotate a dataset -- ClaimDecomp, containing 1,200 complex claims and the decompositions. We believe that this dataset can further promote building explainable fact-checking systems and analyzing complex claims in the real world.Computer Science
How to Evaluate Semantic Communications for Images with ViTScore Metric?
Semantic communications (SC) have been expected to be a new paradigm shifting
to catalyze the next generation communication, whose main concerns shift from
accurate bit transmission to effective semantic information exchange in
communications. However, the previous and widely-used metrics for images are
not applicable to evaluate the image semantic similarity in SC. Classical
metrics to measure the similarity between two images usually rely on the pixel
level or the structural level, such as the PSNR and the MS-SSIM.
Straightforwardly using some tailored metrics based on deep-learning methods in
CV community, such as the LPIPS, is infeasible for SC. To tackle this, inspired
by BERTScore in NLP community, we propose a novel metric for evaluating image
semantic similarity, named Vision Transformer Score (ViTScore). We prove
theoretically that ViTScore has 3 important properties, including symmetry,
boundedness, and normalization, which make ViTScore convenient and intuitive
for image measurement. To evaluate the performance of ViTScore, we compare
ViTScore with 3 typical metrics (PSNR, MS-SSIM, and LPIPS) through 5 classes of
experiments. Experimental results demonstrate that ViTScore can better evaluate
the image semantic similarity than the other 3 typical metrics, which indicates
that ViTScore is an effective performance metric when deployed in SC scenarios
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction
The robustness to distribution changes ensures that NLP models can be
successfully applied in the realistic world, especially for information
extraction tasks. However, most prior evaluation benchmarks have been devoted
to validating pairwise matching correctness, ignoring the crucial measurement
of robustness. In this paper, we present the first benchmark that simulates the
evaluation of open information extraction models in the real world, where the
syntactic and expressive distributions under the same knowledge meaning may
drift variously. We design and annotate a large-scale testbed in which each
example is a knowledge-invariant clique that consists of sentences with
structured knowledge of the same meaning but with different syntactic and
expressive forms. By further elaborating the robustness metric, a model is
judged to be robust if its performance is consistently accurate on the overall
cliques. We perform experiments on typical models published in the last decade
as well as a popular large language model, the results show that the existing
successful models exhibit a frustrating degradation, with a maximum drop of
23.43 F1 score. Our resources and code are available at
https://github.com/qijimrc/ROBUST.Comment: Accepted by EMNLP 2023 Main Conferenc
VisKoP: Visual Knowledge oriented Programming for Interactive Knowledge Base Question Answering
We present Visual Knowledge oriented Programming platform (VisKoP), a
knowledge base question answering (KBQA) system that integrates human into the
loop to edit and debug the knowledge base (KB) queries. VisKoP not only
provides a neural program induction module, which converts natural language
questions into knowledge oriented program language (KoPL), but also maps KoPL
programs into graphical elements. KoPL programs can be edited with simple
graphical operators, such as dragging to add knowledge operators and slot
filling to designate operator arguments. Moreover, VisKoP provides
auto-completion for its knowledge base schema and users can easily debug the
KoPL program by checking its intermediate results. To facilitate the practical
KBQA on a million-entity-level KB, we design a highly efficient KoPL execution
engine for the back-end. Experiment results show that VisKoP is highly
efficient and user interaction can fix a large portion of wrong KoPL programs
to acquire the correct answer. The VisKoP online demo
https://demoviskop.xlore.cn (Stable release of this paper) and
https://viskop.xlore.cn (Beta release with new features), highly efficient KoPL
engine https://pypi.org/project/kopl-engine, and screencast video
https://youtu.be/zAbJtxFPTXo are now publicly available
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
Labeled data are critical to modern machine learning applications, but
obtaining labels can be expensive. To mitigate this cost, machine learning
methods, such as transfer learning, semi-supervised learning and active
learning, aim to be label-efficient: achieving high predictive performance from
relatively few labeled examples. While obtaining the best label-efficiency in
practice often requires combinations of these techniques, existing benchmark
and evaluation frameworks do not capture a concerted combination of all such
techniques. This paper addresses this deficiency by introducing LabelBench, a
new computationally-efficient framework for joint evaluation of multiple
label-efficient learning techniques. As an application of LabelBench, we
introduce a novel benchmark of state-of-the-art active learning methods in
combination with semi-supervised learning for fine-tuning pretrained vision
transformers. Our benchmark demonstrates better label-efficiencies than
previously reported in active learning. LabelBench's modular codebase is
open-sourced for the broader community to contribute label-efficient learning
methods and benchmarks. The repository can be found at:
https://github.com/EfficientTraining/LabelBench
- …