292 research outputs found
Selecting Informative Contexts Improves Language Model Finetuning
We present a general finetuning meta-method that we call information gain
filtration for improving the overall training efficiency and final performance
of language model finetuning. This method uses a secondary learner which
attempts to quantify the benefit of finetuning the language model on each given
example. During the finetuning process, we use this learner to decide whether
or not each given example should be trained on or skipped. We show that it
suffices for this learner to be simple and that the finetuning process itself
is dominated by the relatively trivial relearning of a new unigram frequency
distribution over the modelled language domain, a process which the learner
aids. Our method trains to convergence using 40% fewer batches than normal
finetuning, and achieves a median perplexity of 54.0 on a books dataset
compared to a median perplexity of 57.3 for standard finetuning using the same
neural architecture
Frozen Transformers in Language Models Are Effective Visual Encoder Layers
This paper reveals that large language models (LLMs), despite being trained
solely on textual data, are surprisingly strong encoders for purely visual
tasks in the absence of language. Even more intriguingly, this can be achieved
by a simple yet previously overlooked strategy -- employing a frozen
transformer block from pre-trained LLMs as a constituent encoder layer to
directly process visual tokens. Our work pushes the boundaries of leveraging
LLMs for computer vision tasks, significantly departing from conventional
practices that typically necessitate a multi-modal vision-language setup with
associated language prompts, inputs, or outputs. We demonstrate that our
approach consistently enhances performance across a diverse range of tasks,
encompassing pure 2D and 3D visual recognition tasks (e.g., image and point
cloud classification), temporal modeling tasks (e.g., action recognition),
non-semantic tasks (e.g., motion forecasting), and multi-modal tasks (e.g.,
2D/3D visual question answering and image-text retrieval). Such improvements
are a general phenomenon, applicable to various types of LLMs (e.g., LLaMA and
OPT) and different LLM transformer blocks. We additionally propose the
information filtering hypothesis to explain the effectiveness of pre-trained
LLMs in visual encoding -- the pre-trained LLM transformer blocks discern
informative visual tokens and further amplify their effect. This hypothesis is
empirically supported by the observation that the feature activation, after
training with LLM transformer blocks, exhibits a stronger focus on relevant
regions. We hope that our work inspires new perspectives on utilizing LLMs and
deepening our understanding of their underlying mechanisms. Code is available
at https://github.com/ziqipang/LM4VisualEncoding.Comment: 23 pages, 13 figures. Code at
https://github.com/ziqipang/LM4VisualEncodin
Text-Only Image Captioning with Multi-Context Data Generation
Text-only Image Captioning (TIC) is an approach that aims to construct a
model solely based on text that can accurately describe images. Recently,
diffusion models have demonstrated remarkable capabilities in generating
high-quality images that are semantically coherent with given texts. This
presents an opportunity to generate synthetic training images for TIC. However,
we have identified a challenge that the images generated from simple
descriptions typically exhibit a single perspective with one or limited
contexts, which is not aligned with the complexity of real-world scenes in the
image domain. In this paper, we propose a novel framework that addresses this
issue by introducing multi-context data generation. Starting with an initial
text corpus, our framework employs a large language model to select multiple
sentences that describe the same scene from various perspectives. These
sentences are then summarized into a single sentence with multiple contexts. We
generate simple images using the straightforward sentences and complex images
using the summarized sentences through diffusion models. Finally, we train the
model exclusively using the synthetic image-text pairs obtained from this
process. Experimental results demonstrate that our proposed framework
effectively tackles the central challenge we have identified, achieving the
state-of-the-art performance on popular datasets such as MSCOCO, Flickr30k, and
SS1M
Knowledge-driven Data Construction for Zero-shot Evaluation in Commonsense Question Answering
Recent developments in pre-trained neural language modeling have led to leaps
in accuracy on commonsense question-answering benchmarks. However, there is
increasing concern that models overfit to specific tasks, without learning to
utilize external knowledge or perform general semantic reasoning. In contrast,
zero-shot evaluations have shown promise as a more robust measure of a model's
general reasoning abilities. In this paper, we propose a novel neuro-symbolic
framework for zero-shot question answering across commonsense tasks. Guided by
a set of hypotheses, the framework studies how to transform various
pre-existing knowledge resources into a form that is most effective for
pre-training models. We vary the set of language models, training regimes,
knowledge sources, and data generation strategies, and measure their impact
across tasks. Extending on prior work, we devise and compare four constrained
distractor-sampling strategies. We provide empirical results across five
commonsense question-answering tasks with data generated from five external
knowledge resources. We show that, while an individual knowledge graph is
better suited for specific tasks, a global knowledge graph brings consistent
gains across different tasks. In addition, both preserving the structure of the
task as well as generating fair and informative questions help language models
learn more effectively.Comment: AAAI 202
A Survey on In-context Learning
With the increasing ability of large language models (LLMs), in-context
learning (ICL) has become a new paradigm for natural language processing (NLP),
where LLMs make predictions only based on contexts augmented with a few
examples. It has been a new trend to explore ICL to evaluate and extrapolate
the ability of LLMs. In this paper, we aim to survey and summarize the progress
and challenges of ICL. We first present a formal definition of ICL and clarify
its correlation to related studies. Then, we organize and discuss advanced
techniques, including training strategies, demonstration designing strategies,
as well as related analysis. Finally, we discuss the challenges of ICL and
provide potential directions for further research. We hope that our work can
encourage more research on uncovering how ICL works and improving ICL.Comment: Papers collected until 2023/05/2
SEGO: Sequential Subgoal Optimization for Mathematical Problem-Solving
Large Language Models (LLMs) have driven substantial progress in artificial
intelligence in recent years, exhibiting impressive capabilities across a wide
range of tasks, including mathematical problem-solving. Inspired by the success
of subgoal-based methods, we propose a novel framework called
\textbf{SE}quential sub\textbf{G}oal \textbf{O}ptimization (SEGO) to enhance
LLMs' ability to solve mathematical problems. By establishing a connection
between the subgoal breakdown process and the probability of solving problems,
SEGO aims to identify better subgoals with theoretical guarantees. Addressing
the challenge of identifying suitable subgoals in a large solution space, our
framework generates problem-specific subgoals and adjusts them according to
carefully designed criteria. Incorporating these optimized subgoals into the
policy model training leads to significant improvements in problem-solving
performance. We validate SEGO's efficacy through experiments on two benchmarks,
GSM8K and MATH, where our approach outperforms existing methods, highlighting
the potential of SEGO in AI-driven mathematical problem-solving.
Data and code associated with this paper will be available at
https://github.com/zhaoxlpku/SEGOComment: Preprin
Is This Loss Informative? Faster Text-to-Image Customization by Tracking Objective Dynamics
Text-to-image generation models represent the next step of evolution in image
synthesis, offering a natural way to achieve flexible yet fine-grained control
over the result. One emerging area of research is the fast adaptation of large
text-to-image models to smaller datasets or new visual concepts. However, many
efficient methods of adaptation have a long training time, which limits their
practical applications, slows down research experiments, and spends excessive
GPU resources. In this work, we study the training dynamics of popular
text-to-image personalization methods (such as Textual Inversion or
DreamBooth), aiming to speed them up. We observe that most concepts are learned
at early stages and do not improve in quality later, but standard model
convergence metrics fail to indicate that. Instead, we propose a simple drop-in
early stopping criterion that only requires computing the regular training
objective on a fixed set of inputs for all training iterations. Our experiments
on Stable Diffusion for a range of concepts and for three personalization
methods demonstrate the competitive performance of our approach, making
adaptation up to 8 times faster with no significant drops in quality.Comment: Code: https://github.com/yandex-research/DVAR. 19 pages, 14 figure
Training Language Models with Language Feedback at Scale
Pretrained language models often generate outputs that are not in line with
human preferences, such as harmful text or factually incorrect summaries.
Recent work approaches the above issues by learning from a simple form of human
feedback: comparisons between pairs of model-generated outputs. However,
comparison feedback only conveys limited information about human preferences.
In this paper, we introduce Imitation learning from Language Feedback (ILF), a
new approach that utilizes more informative language feedback. ILF consists of
three steps that are applied iteratively: first, conditioning the language
model on the input, an initial LM output, and feedback to generate refinements.
Second, selecting the refinement incorporating the most feedback. Third,
finetuning the language model to maximize the likelihood of the chosen
refinement given the input. We show theoretically that ILF can be viewed as
Bayesian Inference, similar to Reinforcement Learning from human feedback. We
evaluate ILF's effectiveness on a carefully-controlled toy task and a realistic
summarization task. Our experiments demonstrate that large language models
accurately incorporate feedback and that finetuning with ILF scales well with
the dataset size, even outperforming finetuning on human summaries. Learning
from both language and comparison feedback outperforms learning from each
alone, achieving human-level summarization performance
- …