201 research outputs found
Paraphrasing and Translation
Paraphrasing and translation have previously been treated as unconnected natural lan¬
guage processing tasks. Whereas translation represents the preservation of meaning
when an idea is rendered in the words in a different language, paraphrasing represents
the preservation of meaning when an idea is expressed using different words in the
same language. We show that the two are intimately related. The major contributions
of this thesis are as follows:• We define a novel technique for automatically generating paraphrases using
bilingual parallel corpora, which are more commonly used as training data for
statistical models of translation.• We show that paraphrases can be used to improve the quality of statistical ma¬
chine translation by addressing the problem of coverage and introducing a degree
of generalization into the models.• We explore the topic of automatic evaluation of translation quality, and show that
the current standard evaluation methodology cannot be guaranteed to correlate
with human judgments of translation quality.Whereas previous data-driven approaches to paraphrasing were dependent upon
either data sources which were uncommon such as multiple translation of the same
source text, or language specific resources such as parsers, our approach is able to
harness more widely parallel corpora and can be applied to any language which has
a parallel corpus. The technique was evaluated by replacing phrases with their para¬
phrases, and asking judges whether the meaning of the original phrase was retained
and whether the resulting sentence remained grammatical. Paraphrases extracted from
a parallel corpus with manual alignments are judged to be accurate (both meaningful
and grammatical) 75% of the time, retaining the meaning of the original phrase 85%
of the time. Using automatic alignments, meaning can be retained at a rate of 70%.Being a language independent and probabilistic approach allows our method to be
easily integrated into statistical machine translation. A paraphrase model derived from
parallel corpora other than the one used to train the translation model can be used to
increase the coverage of statistical machine translation by adding translations of previously unseen words and phrases. If the translation of a word was not learned, but
a translation of a synonymous word has been learned, then the word is paraphrased and its paraphrase is translated. Phrases can be treated similarly. Results show that
augmenting a state-of-the-art SMT system with paraphrases in this way leads to significantly improved coverage and translation quality. For a training corpus with 10,000
sentence pairs, we increase the coverage of unique test set unigrams from 48% to 90%,
with more than half of the newly covered items accurately translated, as opposed to
none in current approaches
PAXQA: Generating Cross-lingual Question Answering Examples at Training Scale
Existing question answering (QA) systems owe much of their success to large,
high-quality training data. Such annotation efforts are costly, and the
difficulty compounds in the cross-lingual setting. Therefore, prior
cross-lingual QA work has focused on releasing evaluation datasets, and then
applying zero-shot methods as baselines. In this work, we propose a synthetic
data generation method for cross-lingual QA which leverages indirect
supervision from existing parallel corpora. Our method termed PAXQA
({P}rojecting {a}nnotations for cross-lingual ({x}) QA) decomposes
cross-lingual QA into two stages. In the first stage, we apply a question
generation (QG) model to the English side. In the second stage, we apply
annotation projection to translate both the questions and answers. To better
translate questions, we propose a novel use of lexically-constrained machine
translation, in which constrained entities are extracted from the parallel
bitexts. We release cross-lingual QA datasets across 4 languages, totaling 662K
QA examples. We then show that extractive QA models fine-tuned on these
datasets outperform both zero-shot and prior synthetic data generation models,
showing the sufficient quality of our generations. We find that the largest
performance gains are for cross-lingual directions with non-English questions
and English contexts. Ablation studies show that our dataset generation method
is relatively robust to noise from automatic word alignments
Winter is here: summarizing Twitter streams related to pre-scheduled events
Pre-scheduled events, such as TV shows and sports games, usually garner considerable attention from the public. Twitter captures large volumes of discussions and messages related to these events, in real-time. Twitter streams related to pre-scheduled events are characterized by the following: (1) spikes in the volume of published tweets reflect the highlights of the event and (2) some of the published tweets make reference to the characters involved in the event, in the context in which they are currently portrayed in a subevent. In this paper, we take advantage of these characteristics to identify the highlights of pre-scheduled events from tweet streams and we demonstrate a method to summarize these highlights. We evaluate our algorithm on tweets collected around 2 episodes of a popular TV show, Game of Thrones, Season 7.Published versio
This Land is {Your, My} Land: Evaluating Geopolitical Biases in Language Models
Do the Spratly Islands belong to China, the Philippines, or Vietnam? A
pretrained large language model (LLM) may answer differently if asked in the
languages of each claimant country: Chinese, Tagalog, or Vietnamese. This
contrasts with a multilingual human, who would likely answer consistently. In
this work, we show that LLMs recall geopolitical knowledge inconsistently
across languages -- a phenomenon we term geopolitical bias. As a targeted case
study, we consider territorial disputes, inherently controversial and
cross-lingual task.
We first introduce the BorderLines dataset of territorial disputes. This
covers 256 territories, each of which is associated to a set of multiple-choice
questions in the languages of each claimant country (48 languages total). We
then pose these questions to LLMs to probe their internal knowledge. Finally,
we propose a suite of evaluation metrics based on accuracy, which compares
responses with respect to the actual geopolitical situation, and consistency of
the responses in different languages. These metrics allow us to quantify
several findings, which include instruction-tuned LLMs underperforming base
ones, and geopolitical bias being amplified in stronger models. We release our
code and dataset to facilitate future investigation and mitigation of
geopolitical bias
- …