12 research outputs found
ContraQA: Question Answering under Contradicting Contexts
With a rise in false, inaccurate, and misleading information in propaganda,
news, and social media, real-world Question Answering (QA) systems face the
challenges of synthesizing and reasoning over contradicting information to
derive correct answers. This urgency gives rise to the need to make QA systems
robust to misinformation, a topic previously unexplored. We study the risk of
misinformation to QA models by investigating the behavior of the QA model under
contradicting contexts that are mixed with both real and fake information. We
create the first large-scale dataset for this problem, namely Contra-QA, which
contains over 10K human-written and model-generated contradicting pairs of
contexts. Experiments show that QA models are vulnerable under contradicting
contexts brought by misinformation. To defend against such a threat, we build a
misinformation-aware QA system as a counter-measure that integrates question
answering and misinformation detection in a joint fashion.Comment: Technical repor
Llamas Know What GPTs Don't Show: Surrogate Models for Confidence Estimation
To maintain user trust, large language models (LLMs) should signal low
confidence on examples where they are incorrect, instead of misleading the
user. The standard approach of estimating confidence is to use the softmax
probabilities of these models, but as of November 2023, state-of-the-art LLMs
such as GPT-4 and Claude-v1.3 do not provide access to these probabilities. We
first study eliciting confidence linguistically -- asking an LLM for its
confidence in its answer -- which performs reasonably (80.5% AUC on GPT-4
averaged across 12 question-answering datasets -- 7% above a random baseline)
but leaves room for improvement. We then explore using a surrogate confidence
model -- using a model where we do have probabilities to evaluate the original
model's confidence in a given question. Surprisingly, even though these
probabilities come from a different and often weaker model, this method leads
to higher AUC than linguistic confidences on 9 out of 12 datasets. Our best
method composing linguistic confidences and surrogate model probabilities gives
state-of-the-art confidence estimates on all 12 datasets (84.6% average AUC on
GPT-4)
A Call to Reflect on Evaluation Practices for Failure Detection in Image Classification
Reliable application of machine learning-based decision systems in the wild
is one of the major challenges currently investigated by the field. A large
portion of established approaches aims to detect erroneous predictions by means
of assigning confidence scores. This confidence may be obtained by either
quantifying the model's predictive uncertainty, learning explicit scoring
functions, or assessing whether the input is in line with the training
distribution. Curiously, while these approaches all state to address the same
eventual goal of detecting failures of a classifier upon real-life application,
they currently constitute largely separated research fields with individual
evaluation protocols, which either exclude a substantial part of relevant
methods or ignore large parts of relevant failure sources. In this work, we
systematically reveal current pitfalls caused by these inconsistencies and
derive requirements for a holistic and realistic evaluation of failure
detection. To demonstrate the relevance of this unified perspective, we present
a large-scale empirical study for the first time enabling benchmarking
confidence scoring functions w.r.t all relevant methods and failure sources.
The revelation of a simple softmax response baseline as the overall best
performing method underlines the drastic shortcomings of current evaluation in
the abundance of publicized research on confidence scoring. Code and trained
models are at https://github.com/IML-DKFZ/fd-shifts
ASPEST: Bridging the Gap Between Active Learning and Selective Prediction
Selective prediction aims to learn a reliable model that abstains from making
predictions when the model uncertainty is high. These predictions can then be
deferred to a human expert for further evaluation. In many real-world
scenarios, however, the distribution of test data is different from the
training data. This results in more inaccurate predictions, necessitating
increased human labeling, which is difficult and expensive in many scenarios.
Active learning circumvents this difficulty by only querying the most
informative examples and, in several cases, has been shown to lower the overall
labeling effort. In this work, we bridge the gap between selective prediction
and active learning, proposing a new learning paradigm called active selective
prediction which learns to query more informative samples from the shifted
target domain while increasing accuracy and coverage. For this new problem, we
propose a simple but effective solution, ASPEST, that trains ensembles of model
snapshots using self-training with their aggregated outputs as pseudo labels.
Extensive experiments on several image, text and structured datasets with
domain shifts demonstrate that active selective prediction can significantly
outperform prior work on selective prediction and active learning (e.g. on the
MNISTSVHN benchmark with the labeling budget of 100, ASPEST improves the
AUC metric from 79.36% to 88.84%) and achieves more optimal utilization of
humans in the loop
Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment
Ensuring alignment, which refers to making models behave in accordance with
human intentions [1,2], has become a critical task before deploying large
language models (LLMs) in real-world applications. For instance, OpenAI devoted
six months to iteratively aligning GPT-4 before its release [3]. However, a
major challenge faced by practitioners is the lack of clear guidance on
evaluating whether LLM outputs align with social norms, values, and
regulations. This obstacle hinders systematic iteration and deployment of LLMs.
To address this issue, this paper presents a comprehensive survey of key
dimensions that are crucial to consider when assessing LLM trustworthiness. The
survey covers seven major categories of LLM trustworthiness: reliability,
safety, fairness, resistance to misuse, explainability and reasoning, adherence
to social norms, and robustness. Each major category is further divided into
several sub-categories, resulting in a total of 29 sub-categories.
Additionally, a subset of 8 sub-categories is selected for further
investigation, where corresponding measurement studies are designed and
conducted on several widely-used LLMs. The measurement results indicate that,
in general, more aligned models tend to perform better in terms of overall
trustworthiness. However, the effectiveness of alignment varies across the
different trustworthiness categories considered. This highlights the importance
of conducting more fine-grained analyses, testing, and making continuous
improvements on LLM alignment. By shedding light on these key dimensions of LLM
trustworthiness, this paper aims to provide valuable insights and guidance to
practitioners in the field. Understanding and addressing these concerns will be
crucial in achieving reliable and ethically sound deployment of LLMs in various
applications