17 research outputs found
Statistical Knowledge Assessment for Large Language Models
Given varying prompts regarding a factoid question, can a large language
model (LLM) reliably generate factually correct answers? Existing LLMs may
generate distinct responses for different prompts. In this paper, we study the
problem of quantifying knowledge contained in an LLM regarding a given set of
facts. We propose KaRR, a statistical approach to assess factual knowledge for
LLMs. The main idea is to estimate the ratio of LLM generating text
corresponding to the answer entity given diverse prompts of the subject and the
querying relation, versus it generating by random chances. Our assessment suite
contains a comprehensive set of 994,123 entities and 600 relations, with
1,395,905 text aliases. We use our method to evaluate 20 LLMs of various sizes,
including LLaMA, Alpaca, OPT, etc. Experiments show that our results have a
strong correlation (0.43 Kendall's ) with the results of human assessment
on LLMs. Our results reveal that the knowledge in LLMs with the same backbone
architecture adheres to the scaling law, while tuning on instruction-following
data sometimes compromises the model's capability to generate factually correct
text reliably.Comment: Accepted by NeurIPS 202
ImageNetVC: Zero-Shot Visual Commonsense Evaluation on 1000 ImageNet Categories
Recently, Pretrained Language Models (PLMs) have been serving as
general-purpose interfaces, posing a significant demand for comprehensive
visual knowledge. However, it remains unclear how well current PLMs and their
visually augmented counterparts (VaLMs) can master visual commonsense
knowledge. To investigate this, we propose ImageNetVC, a fine-grained,
human-annotated dataset specifically designed for zero-shot visual commonsense
evaluation across 1,000 ImageNet categories. Utilizing ImageNetVC, we delve
into the fundamental visual commonsense knowledge of both unimodal PLMs and
VaLMs, uncovering the scaling law and the influence of the backbone model on
VaLMs. Furthermore, we investigate the factors affecting the visual commonsense
knowledge of large-scale models, providing insights into the development of
language models enriched with visual commonsense knowledge. Our code and
dataset are available at https://github.com/hemingkx/ImageNetVC
Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
Large language models (LLMs) have demonstrated remarkable potential in
handling multilingual machine translation (MMT). In this paper, we
systematically investigate the advantages and challenges of LLMs for MMT by
answering two questions: 1) How well do LLMs perform in translating a massive
number of languages? 2) Which factors affect LLMs' performance in translation?
We evaluate popular LLMs, including XGLM, OPT, BLOOMZ, and ChatGPT, on 102
languages. Our empirical results show that even the best model ChatGPT still
lags behind the supervised baseline NLLB in 83.33% of translation directions.
Through further analysis, we discover that LLMs exhibit new working patterns
when used for MMT. First, prompt semantics can surprisingly be ignored when
given in-context exemplars, where LLMs still show strong performance even with
unreasonable prompts. Second, cross-lingual exemplars can provide better task
instruction for low-resource translation than exemplars in the same language
pairs. Third, we observe the overestimated performance of BLOOMZ on dataset
Flores-101, indicating the potential risk when using public datasets for
evaluation
Extrapolating Large Language Models to Non-English by Aligning Languages
Existing large language models show disparate capability across different
languages, due to the imbalance in the training data. Their performances on
English tasks are often stronger than on tasks of other languages. In this
paper, we empower pre-trained LLMs on non-English languages by building
semantic alignment across languages. We start from targeting individual
languages by performing cross-lingual instruction-tuning (CoIT) on LLaMA, i.e.
tuning it with translation task data and cross-lingual general task data to
obtain cross-lingual models (x-LLaMAs), and formulate underlying scaling laws
to investigate the advantages of using scalable translation data. Then we
perform multilingual instruction-tuning (MuIT) with mixed resources to build
multilingual m-LLaMA. We also illustrate how we leverage the scaling laws to
optimize data allocation in a resource-constrained setting. Experiment results
on cross-lingual benchmarks XQUAD and MLQA show that x-LLaMAs surpass the
English instruction-tuned counterpart (Alpaca) by an average of 27.83% across
six non-English languages. Evaluation results on translation dataset Flores-101
show that x-LLaMAs outperform previous LLaMA-based models by an average of
18.89%. Encouragingly, m-LLaMA achieves comparable performance to x-LLaMAs on
individual languages and demonstrates the ability to follow multilingual
instructions. Further analysis on response content and representation space
reveals the alignment of the multilingual semantic space within the middle
layers of m-LLaMA
A Survey on In-context Learning
With the increasing ability of large language models (LLMs), in-context
learning (ICL) has become a new paradigm for natural language processing (NLP),
where LLMs make predictions only based on contexts augmented with a few
examples. It has been a new trend to explore ICL to evaluate and extrapolate
the ability of LLMs. In this paper, we aim to survey and summarize the progress
and challenges of ICL. We first present a formal definition of ICL and clarify
its correlation to related studies. Then, we organize and discuss advanced
techniques, including training strategies, demonstration designing strategies,
as well as related analysis. Finally, we discuss the challenges of ICL and
provide potential directions for further research. We hope that our work can
encourage more research on uncovering how ICL works and improving ICL.Comment: Papers collected until 2023/05/2