25 research outputs found
Logic Against Bias: Textual Entailment Mitigates Stereotypical Sentence Reasoning
Due to their similarity-based learning objectives, pretrained sentence
encoders often internalize stereotypical assumptions that reflect the social
biases that exist within their training corpora. In this paper, we describe
several kinds of stereotypes concerning different communities that are present
in popular sentence representation models, including pretrained next sentence
prediction and contrastive sentence representation models. We compare such
models to textual entailment models that learn language logic for a variety of
downstream language understanding tasks. By comparing strong pretrained models
based on text similarity with textual entailment learning, we conclude that the
explicit logic learning with textual entailment can significantly reduce bias
and improve the recognition of social communities, without an explicit
de-biasing processComment: Accepted by EACL 202
Entailment as Robust Self-Learner
Entailment has been recognized as an important metric for evaluating natural
language understanding (NLU) models, and recent studies have found that
entailment pretraining benefits weakly supervised fine-tuning. In this work, we
design a prompting strategy that formulates a number of different NLU tasks as
contextual entailment. This approach improves the zero-shot adaptation of
pretrained entailment models. Secondly, we notice that self-training
entailment-based models with unlabeled data can significantly improve the
adaptation performance on downstream tasks. To achieve more stable improvement,
we propose the Simple Pseudo-Label Editing (SimPLE) algorithm for better
pseudo-labeling quality in self-training. We also found that both pretrained
entailment-based models and the self-trained models are robust against
adversarial evaluation data. Experiments on binary and multi-class
classification tasks show that SimPLE leads to more robust self-training
results, indicating that the self-trained entailment models are more efficient
and trustworthy than large language models on language understanding tasks.Comment: Accepted by ACL 2023 main conferenc
Cooperative Learning of Zero-Shot Machine Reading Comprehension
Pretrained language models have significantly improved the performance of
down-stream language understanding tasks, including extractive question
answering, by providing high-quality contextualized word embeddings. However,
learning question answering models still need large-scaled data annotation in
specific domains. In this work, we propose a cooperative, self-play learning
framework, REGEX, for question generation and answering. REGEX is built upon a
masked answer extraction task with an interactive learning environment
containing an answer entity REcognizer, a question Generator, and an answer
EXtractor. Given a passage with a masked entity, the generator generates a
question around the entity, and the extractor is trained to extract the masked
entity with the generated question and raw texts. The framework allows the
training of question generation and answering models on any text corpora
without annotation. We further leverage a reinforcement learning technique to
reward generating high-quality questions and to improve the answer extraction
model's performance. Experiment results show that REGEX outperforms the
state-of-the-art (SOTA) pretrained language models and zero-shot approaches on
standard question-answering benchmarks, and yields the new SOTA performance
under the zero-shot setting
Listen, Think, and Understand
The ability of artificial intelligence (AI) systems to perceive and
comprehend audio signals is crucial for many applications. Although significant
progress has been made in this area since the development of AudioSet, most
existing models are designed to map audio inputs to pre-defined, discrete sound
label sets. In contrast, humans possess the ability to not only classify sounds
into general categories, but also to listen to the finer details of the sounds,
explain the reason for the predictions, think about what the sound infers, and
understand the scene and what action needs to be taken, if any. Such
capabilities beyond perception are not yet present in existing audio models. On
the other hand, modern large language models (LLMs) exhibit emerging reasoning
ability but they lack audio perception capabilities. Therefore, we ask the
question: can we build a model that has both audio perception and a reasoning
ability?
In this paper, we propose a new audio foundation model, called LTU (Listen,
Think, and Understand). To train LTU, we created a new OpenAQA-5M dataset
consisting of 1.9 million closed-ended and 3.7 million open-ended, diverse
(audio, question, answer) tuples, and have used an autoregressive training
framework with a perception-to-understanding curriculum. LTU demonstrates
strong performance and generalization ability on conventional audio tasks such
as classification and captioning. More importantly, it exhibits emerging audio
reasoning and comprehension abilities that are absent in existing audio models.
To the best of our knowledge, LTU is one of the first multimodal large language
models that focus on general audio (rather than just speech) understanding.Comment: Accepted at ICLR 2024. Code, dataset, and models are available at
https://github.com/YuanGongND/ltu. The interactive demo is at
https://huggingface.co/spaces/yuangongfdu/lt
Chain of Thought Prompt Tuning in Vision Language Models
Language-Image Pre-training has demonstrated promising results on zero-shot
and few-shot downstream tasks by prompting visual models with natural language
prompts. However, most recent studies only use a single prompt for tuning,
neglecting the inherent step-to-step cognitive reasoning process that humans
conduct in complex task settings, for example, when processing images from
unfamiliar domains. Chain of Thought is a simple and effective approximation to
human reasoning process and has been proven useful for natural language
processing (NLP) tasks. Based on this cognitive intuition, we believe that
conducting effective reasoning is also an important problem in visual tasks,
and a chain of thought could be a solution to this problem. In this work, we
propose a novel chain of thought prompt tuning for vision-language modeling.
Extensive experiments show that our method not only generalizes better in image
classification tasks, has greater transferability beyond a single dataset, and
has stronger domain generalization performance, but also performs much better
in imagetext retrieval and visual question answering, which require more
reasoning capabilities. We are the first to successfully adapt chain-of-thought
prompting that combines visual and textual embeddings. We will release our
code
DoLa: Decoding by Contrasting Layers Improves Factuality in Large Language Models
Despite their impressive capabilities, large language models (LLMs) are prone
to hallucinations, i.e., generating content that deviates from facts seen
during pretraining. We propose a simple decoding strategy for reducing
hallucinations with pretrained LLMs that does not require conditioning on
retrieved external knowledge nor additional fine-tuning. Our approach obtains
the next-token distribution by contrasting the differences in logits obtained
from projecting the later layers versus earlier layers to the vocabulary space,
exploiting the fact that factual knowledge in an LLMs has generally been shown
to be localized to particular transformer layers. We find that this Decoding by
Contrasting Layers (DoLa) approach is able to better surface factual knowledge
and reduce the generation of incorrect facts. DoLa consistently improves the
truthfulness across multiple choices tasks and open-ended generation tasks, for
example improving the performance of LLaMA family models on TruthfulQA by
12-17% absolute points, demonstrating its potential in making LLMs reliably
generate truthful facts.Comment: ICLR 2024 main conference paper. The source code is available at
https://github.com/voidism/DoL
RQ-RAG: Learning to Refine Queries for Retrieval Augmented Generation
Large Language Models (LLMs) exhibit remarkable capabilities but are prone to
generating inaccurate or hallucinatory responses. This limitation stems from
their reliance on vast pretraining datasets, making them susceptible to errors
in unseen scenarios. To tackle these challenges, Retrieval-Augmented Generation
(RAG) addresses this by incorporating external, relevant documents into the
response generation process, thus leveraging non-parametric knowledge alongside
LLMs' in-context learning abilities. However, existing RAG implementations
primarily focus on initial input for context retrieval, overlooking the nuances
of ambiguous or complex queries that necessitate further clarification or
decomposition for accurate responses. To this end, we propose learning to
Refine Query for Retrieval Augmented Generation (RQ-RAG) in this paper,
endeavoring to enhance the model by equipping it with capabilities for explicit
rewriting, decomposition, and disambiguation. Our experimental results indicate
that our method, when applied to a 7B Llama2 model, surpasses the previous
state-of-the-art (SOTA) by an average of 1.9\% across three single-hop QA
datasets, and also demonstrates enhanced performance in handling complex,
multi-hop QA datasets. Our code is available at
https://github.com/chanchimin/RQ-RAG