242 research outputs found
Evaluating Multilingual Gisting of Web Pages
We describe a prototype system for multilingual gisting of Web pages, and
present an evaluation methodology based on the notion of gisting as decision
support. This evaluation paradigm is straightforward, rigorous, permits fair
comparison of alternative approaches, and should easily generalize to
evaluation in other situations where the user is faced with decision-making on
the basis of information in restricted or alternative form.Comment: 7 pages, uses psfig and aaai style
Parallel Strands: A Preliminary Investigation into Mining the Web for Bilingual Text
Parallel corpora are a valuable resource for machine translation, but at
present their availability and utility is limited by genre- and
domain-specificity, licensing restrictions, and the basic difficulty of
locating parallel texts in all but the most dominant of the world's languages.
A parallel corpus resource not yet explored is the World Wide Web, which hosts
an abundance of pages in parallel translation, offering a potential solution to
some of these problems and unique opportunities of its own. This paper presents
the necessary first step in that exploration: a method for automatically
finding parallel translated documents on the Web. The technique is conceptually
simple, fully language independent, and scalable, and preliminary evaluation
results indicate that the method may be accurate enough to apply without human
intervention.Comment: LaTeX2e, 11 pages, 7 eps figures; uses psfig, llncs.cls, theapa.sty.
An Appendix at http://umiacs.umd.edu/~resnik/amta98/amta98_appendix.html
contains test dat
Words, Subwords, and Morphemes: What Really Matters in the Surprisal-Reading Time Relationship?
An important assumption that comes with using LLMs on psycholinguistic data
has gone unverified. LLM-based predictions are based on subword tokenization,
not decomposition of words into morphemes. Does that matter? We carefully test
this by comparing surprisal estimates using orthographic, morphological, and
BPE tokenization against reading time data. Our results replicate previous
findings and provide evidence that in the aggregate, predictions using BPE
tokenization do not suffer relative to morphological and orthographic
segmentation. However, a finer-grained analysis points to potential issues with
relying on BPE-based tokenization, as well as providing promising results
involving morphologically-aware surprisal estimates and suggesting a new method
for evaluating morphological prediction.Comment: Accepted to Findings of EMNLP 2023; 10 pages, 5 figure
Natural Language Decompositions of Implicit Content Enable Better Text Representations
When people interpret text, they rely on inferences that go beyond the
observed language itself. Inspired by this observation, we introduce a method
for the analysis of text that takes implicitly communicated content explicitly
into account. We use a large language model to produce sets of propositions
that are inferentially related to the text that has been observed, then
validate the plausibility of the generated content via human judgments.
Incorporating these explicit representations of implicit content proves useful
in multiple problem settings that involve the human interpretation of
utterances: assessing the similarity of arguments, making sense of a body of
opinion data, and modeling legislative behavior. Our results suggest that
modeling the meanings behind observed language, rather than the literal text
alone, is a valuable direction for NLP and particularly its applications to
social science.Comment: Accepted to EMNLP 2023 (Main conference
Mainstream News Articles Co-Shared with Fake News Buttress Misinformation Narratives
Most prior and current research examining misinformation spread on social
media focuses on reports published by 'fake' news sources. These approaches
fail to capture another potential form of misinformation with a much larger
audience: factual news from mainstream sources ('real' news) repurposed to
promote false or misleading narratives. We operationalize narratives using an
existing unsupervised NLP technique and examine the narratives present in
misinformation content. We find that certain articles from reliable outlets are
shared by a disproportionate number of users who also shared fake news on
Twitter. We consider these 'real' news articles to be co-shared with fake news.
We show that co-shared articles contain existing misinformation narratives at a
significantly higher rate than articles from the same reliable outlets that are
not co-shared with fake news. This holds true even when articles are chosen
following strict criteria of reliability for the outlets and after accounting
for the alternative explanation of partisan curation of articles. For example,
we observe that a recent article published by The Washington Post titled
"Vaccinated people now make up a majority of COVID deaths" was
disproportionately shared by Twitter users with a history of sharing
anti-vaccine false news reports. Our findings suggest a strategic repurposing
of mainstream news by conveyors of misinformation as a way to enhance the reach
and persuasiveness of misleading narratives. We also conduct a comprehensive
case study to help highlight how such repurposing can happen on Twitter as a
consequence of the inclusion of particular narratives in the framing of
mainstream news
- …