31 research outputs found
Robust Multilingual Part-of-Speech Tagging via Adversarial Training
Adversarial training (AT) is a powerful regularization method for neural
networks, aiming to achieve robustness to input perturbations. Yet, the
specific effects of the robustness obtained from AT are still unclear in the
context of natural language processing. In this paper, we propose and analyze a
neural POS tagging model that exploits AT. In our experiments on the Penn
Treebank WSJ corpus and the Universal Dependencies (UD) dataset (27 languages),
we find that AT not only improves the overall tagging accuracy, but also 1)
prevents over-fitting well in low resource languages and 2) boosts tagging
accuracy for rare / unseen words. We also demonstrate that 3) the improved
tagging performance by AT contributes to the downstream task of dependency
parsing, and that 4) AT helps the model to learn cleaner word representations.
5) The proposed AT model is generally effective in different sequence labeling
tasks. These positive results motivate further use of AT for natural language
tasks.Comment: NAACL 201
Recommended from our members
Jabberwocky Parsing: Dependency Parsing with Lexical Noise
Parsing models have long benefited from the use of lexical information, and indeed current state-of-the art neural network models for dependency parsing achieve substantial improvements by benefiting from distributed representations of lexical information. At the same time, humans can easily parse sentences with unknown or even novel words, as in Lewis Carroll’s poem Jabberwocky. In this paper, we carry out jabberwocky parsing experiments, exploring how robust a state-of-the-art neural network parser is to the absence of lexical information. We find that current parsing models, at least under usual training regimens, are in fact overly dependent on lexical information, and perform badly in the jabberwocky context. We also demonstrate that the technique of word dropout drastically improves parsing robustness in this setting, and also leads to significant improvements in out-of-domain parsing
Batch Prompting: Efficient Inference with Large Language Model APIs
Performing inference on hundreds of thousands of samples with large language
models (LLMs) can be computationally and financially costly. We propose batch
prompting, a simple alternative prompting approach that enables the LLM to run
inference in batches, instead of one sample at a time. Our method reduces both
token and time costs while retaining downstream performance. We theoretically
demonstrate that under a few-shot in-context learning setting, the inference
costs decrease almost inverse linearly with the number of samples in each
batch. We extensively validate the effectiveness of batch prompting on ten
datasets across commonsense QA, arithmetic reasoning, and NLI/NLU: batch
prompting significantly~(up to with six samples in batch) reduces the
LLM (Codex) inference token and time costs while achieving better or comparable
performance. Our analysis shows that the number of samples in each batch and
the complexity of tasks affect its performance. Further, batch prompting can be
applied across different LLMs and reasoning methods.Comment: 18 pages, 9 figure
Evaluating GPT-4 and ChatGPT on Japanese Medical Licensing Examinations
As large language models (LLMs) gain popularity among speakers of diverse
languages, we believe that it is crucial to benchmark them to better understand
model behaviors, failures, and limitations in languages beyond English. In this
work, we evaluate LLM APIs (ChatGPT, GPT-3, and GPT-4) on the Japanese national
medical licensing examinations from the past five years, including the current
year. Our team comprises native Japanese-speaking NLP researchers and a
practicing cardiologist based in Japan. Our experiments show that GPT-4
outperforms ChatGPT and GPT-3 and passes all six years of the exams,
highlighting LLMs' potential in a language that is typologically distant from
English. However, our evaluation also exposes critical limitations of the
current LLM APIs. First, LLMs sometimes select prohibited choices that should
be strictly avoided in medical practice in Japan, such as suggesting
euthanasia. Further, our analysis shows that the API costs are generally higher
and the maximum context size is smaller for Japanese because of the way
non-Latin scripts are currently tokenized in the pipeline. We release our
benchmark as Igaku QA as well as all model outputs and exam metadata. We hope
that our results and benchmark will spur progress on more diverse applications
of LLMs. Our benchmark is available at https://github.com/jungokasai/IgakuQA.Comment: Added results from the March 2023 exa
Evaluating Spatial Understanding of Large Language Models
Large language models (LLMs) show remarkable capabilities across a variety of
tasks. Despite the models only seeing text in training, several recent studies
suggest that LLM representations implicitly capture aspects of the underlying
grounded concepts. Here, we explore LLM representations of a particularly
salient kind of grounded knowledge -- spatial relationships. We design
natural-language navigation tasks and evaluate the ability of LLMs, in
particular GPT-3.5-turbo, GPT-4, and Llama2 series models, to represent and
reason about spatial structures, and compare these abilities to human
performance on the same tasks. These tasks reveal substantial variability in
LLM performance across different spatial structures, including square,
hexagonal, and triangular grids, rings, and trees. We also discover that,
similar to humans, LLMs utilize object names as landmarks for maintaining
spatial maps. Finally, in extensive error analysis, we find that LLMs' mistakes
reflect both spatial and non-spatial factors. These findings suggest that LLMs
appear to capture certain aspects of spatial structure implicitly, but room for
improvement remains
ACID: Abstractive, Content-Based IDs for Document Retrieval with Language Models
Generative retrieval (Wang et al., 2022; Tay et al., 2022) is a new approach
for end-to-end document retrieval that directly generates document identifiers
given an input query. Techniques for designing effective, high-quality document
IDs remain largely unexplored. We introduce ACID, in which each document's ID
is composed of abstractive keyphrases generated by a large language model,
rather than an integer ID sequence as done in past work. We compare our method
with the current state-of-the-art technique for ID generation, which produces
IDs through hierarchical clustering of document embeddings. We also examine
simpler methods to generate natural-language document IDs, including the naive
approach of using the first k words of each document as its ID or words with
high BM25 scores in that document. We show that using ACID improves top-10 and
top-20 accuracy by 15.6% and 14.4% (relative) respectively versus the
state-of-the-art baseline on the MSMARCO 100k retrieval task, and 4.4% and 4.0%
respectively on the Natural Questions 100k retrieval task. Our results
demonstrate the effectiveness of human-readable, natural-language IDs in
generative retrieval with LMs. The code for reproducing our results and the
keyword-augmented datasets will be released on formal publication