612 research outputs found
Knowledge-driven Natural Language Understanding of English Text and its Applications
Understanding the meaning of a text is a fundamental challenge of natural
language understanding (NLU) research. An ideal NLU system should process a
language in a way that is not exclusive to a single task or a dataset. Keeping
this in mind, we have introduced a novel knowledge driven semantic
representation approach for English text. By leveraging the VerbNet lexicon, we
are able to map syntax tree of the text to its commonsense meaning represented
using basic knowledge primitives. The general purpose knowledge represented
from our approach can be used to build any reasoning based NLU system that can
also provide justification. We applied this approach to construct two NLU
applications that we present here: SQuARE (Semantic-based Question Answering
and Reasoning Engine) and StaCACK (Stateful Conversational Agent using
Commonsense Knowledge). Both these systems work by "truly understanding" the
natural language text they process and both provide natural language
explanations for their responses while maintaining high accuracy.Comment: Preprint. Accepted by the 35th AAAI Conference (AAAI-21) Main Track
Schema2QA: High-Quality and Low-Cost Q&A Agents for the Structured Web
Building a question-answering agent currently requires large annotated
datasets, which are prohibitively expensive. This paper proposes Schema2QA, an
open-source toolkit that can generate a Q&A system from a database schema
augmented with a few annotations for each field. The key concept is to cover
the space of possible compound queries on the database with a large number of
in-domain questions synthesized with the help of a corpus of generic query
templates. The synthesized data and a small paraphrase set are used to train a
novel neural network based on the BERT pretrained model. We use Schema2QA to
generate Q&A systems for five Schema.org domains, restaurants, people, movies,
books and music, and obtain an overall accuracy between 64% and 75% on
crowdsourced questions for these domains. Once annotations and paraphrases are
obtained for a Schema.org schema, no additional manual effort is needed to
create a Q&A agent for any website that uses the same schema. Furthermore, we
demonstrate that learning can be transferred from the restaurant to the hotel
domain, obtaining a 64% accuracy on crowdsourced questions with no manual
effort. Schema2QA achieves an accuracy of 60% on popular restaurant questions
that can be answered using Schema.org. Its performance is comparable to Google
Assistant, 7% lower than Siri, and 15% higher than Alexa. It outperforms all
these assistants by at least 18% on more complex, long-tail questions
Extrinsic Evaluation of Machine Translation Metrics
Automatic machine translation (MT) metrics are widely used to distinguish the
translation qualities of machine translation systems across relatively large
test sets (system-level evaluation). However, it is unclear if automatic
metrics are reliable at distinguishing good translations from bad translations
at the sentence level (segment-level evaluation). In this paper, we investigate
how useful MT metrics are at detecting the success of a machine translation
component when placed in a larger platform with a downstream task. We evaluate
the segment-level performance of the most widely used MT metrics (chrF, COMET,
BERTScore, etc.) on three downstream cross-lingual tasks (dialogue state
tracking, question answering, and semantic parsing). For each task, we only
have access to a monolingual task-specific model. We calculate the correlation
between the metric's ability to predict a good/bad translation with the
success/failure on the final task for the Translate-Test setup. Our experiments
demonstrate that all metrics exhibit negligible correlation with the extrinsic
evaluation of the downstream outcomes. We also find that the scores provided by
neural metrics are not interpretable mostly because of undefined ranges. We
synthesise our analysis into recommendations for future MT metrics to produce
labels rather than scores for more informative interaction between machine
translation and multilingual language understanding.Comment: ACL 2023 Camera Read
Inference-Time Intervention: Eliciting Truthful Answers from a Language Model
We introduce Inference-Time Intervention (ITI), a technique designed to
enhance the truthfulness of large language models (LLMs). ITI operates by
shifting model activations during inference, following a set of directions
across a limited number of attention heads. This intervention significantly
improves the performance of LLaMA models on the TruthfulQA benchmark. On an
instruction-finetuned LLaMA called Alpaca, ITI improves its truthfulness from
32.5% to 65.1%. We identify a tradeoff between truthfulness and helpfulness and
demonstrate how to balance it by tuning the intervention strength. ITI is
minimally invasive and computationally inexpensive. Moreover, the technique is
data efficient: while approaches like RLHF require extensive annotations, ITI
locates truthful directions using only few hundred examples. Our findings
suggest that LLMs may have an internal representation of the likelihood of
something being true, even as they produce falsehoods on the surface.Comment: code: https://github.com/likenneth/honest_llam
Red-Teaming Large Language Models using Chain of Utterances for Safety-Alignment
Larger language models (LLMs) have taken the world by storm with their
massive multi-tasking capabilities simply by optimizing over a next-word
prediction objective. With the emergence of their properties and encoded
knowledge, the risk of LLMs producing harmful outputs increases, making them
unfit for scalable deployment for the public. In this work, we propose a new
safety evaluation benchmark RED-EVAL that carries out red-teaming. We show that
even widely deployed models are susceptible to the Chain of Utterances-based
(CoU) prompting, jailbreaking closed source LLM-based systems such as GPT-4 and
ChatGPT to unethically respond to more than 65% and 73% of harmful queries. We
also demonstrate the consistency of the RED-EVAL across 8 open-source LLMs in
generating harmful responses in more than 86% of the red-teaming attempts.
Next, we propose RED-INSTRUCT--An approach for the safety alignment of LLMs. It
constitutes two phases: 1) HARMFULQA data collection: Leveraging CoU prompting,
we collect a dataset that consists of 1.9K harmful questions covering a wide
range of topics, 9.5K safe and 7.3K harmful conversations from ChatGPT; 2)
SAFE-ALIGN: We demonstrate how the conversational dataset can be used for the
safety alignment of LLMs by minimizing the negative log-likelihood over helpful
responses and penalizing over harmful responses by gradient accent over sample
loss. Our model STARLING, a fine-tuned Vicuna-7B, is observed to be more safely
aligned when evaluated on RED-EVAL and HHH benchmarks while preserving the
utility of the baseline models (TruthfulQA, MMLU, and BBH)
A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity
This paper proposes a framework for quantitatively evaluating interactive
LLMs such as ChatGPT using publicly available data sets. We carry out an
extensive technical evaluation of ChatGPT using 23 data sets covering 8
different common NLP application tasks. We evaluate the multitask, multilingual
and multi-modal aspects of ChatGPT based on these data sets and a newly
designed multimodal dataset. We find that ChatGPT outperforms LLMs with
zero-shot learning on most tasks and even outperforms fine-tuned models on some
tasks. We find that it is better at understanding non-Latin script languages
than generating them. It is able to generate multimodal content from textual
prompts, via an intermediate code generation step. Moreover, we find that
ChatGPT is 63.41% accurate on average in 10 different reasoning categories
under logical reasoning, non-textual reasoning, and commonsense reasoning,
hence making it an unreliable reasoner. It is, for example, better at deductive
than inductive reasoning. ChatGPT suffers from hallucination problems like
other LLMs and it generates more extrinsic hallucinations from its parametric
memory as it does not have access to an external knowledge base. Finally, the
interactive feature of ChatGPT enables human collaboration with the underlying
LLM to improve its performance, i.e, 8% ROUGE-1 on summarization and 2% ChrF++
on machine translation, in a multi-turn "prompt engineering" fashion. We also
release codebase for evaluation set extraction.Comment: 45 pages, AACL 202
- …