20 research outputs found
Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations
We investigate the pertinence of methods from algebraic topology for text
data analysis. These methods enable the development of
mathematically-principled isometric-invariant mappings from a set of vectors to
a document embedding, which is stable with respect to the geometry of the
document in the selected metric space. In this work, we evaluate the utility of
these topology-based document representations in traditional NLP tasks,
specifically document clustering and sentiment classification. We find that the
embeddings do not benefit text analysis. In fact, performance is worse than
simple techniques like , indicating that the geometry of the
document does not provide enough variability for classification on the basis of
topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201
Zero-shot Neural Transfer for Cross-lingual Entity Linking
Cross-lingual entity linking maps an entity mention in a source language to
its corresponding entry in a structured knowledge base that is in a different
(target) language. While previous work relies heavily on bilingual lexical
resources to bridge the gap between the source and the target languages, these
resources are scarce or unavailable for many low-resource languages. To address
this problem, we investigate zero-shot cross-lingual entity linking, in which
we assume no bilingual lexical resources are available in the source
low-resource language. Specifically, we propose pivot-based entity linking,
which leverages information from a high-resource "pivot" language to train
character-level neural entity linking models that are transferred to the source
low-resource language in a zero-shot manner. With experiments on 9 low-resource
languages and transfer through a total of 54 languages, we show that our
proposed pivot-based framework improves entity linking accuracy 17% (absolute)
on average over the baseline systems, for the zero-shot scenario. Further, we
also investigate the use of language-universal phonological representations
which improves average accuracy (absolute) by 36% when transferring between
languages that use different scripts.Comment: To appear in AAAI 201
OCR Post Correction for Endangered Language Texts
There is little to no data available to build natural language processing
models for most endangered languages. However, textual data in these languages
often exists in formats that are not machine-readable, such as paper books and
scanned images. In this work, we address the task of extracting text from these
resources. We create a benchmark dataset of transcriptions for scanned books in
three critically endangered languages and present a systematic analysis of how
general-purpose OCR tools are not robust to the data-scarce setting of
endangered languages. We develop an OCR post-correction method tailored to ease
training in this data-scarce setting, reducing the recognition error rate by
34% on average across the three languages.Comment: Accepted to EMNLP 202
CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models
Effectively using Natural Language Processing (NLP) tools in under-resourced
languages requires a thorough understanding of the language itself, familiarity
with the latest models and training methodologies, and technical expertise to
deploy these models. This could present a significant obstacle for language
community members and linguists to use NLP tools. This paper introduces the CMU
Linguistic Annotation Backend, an open-source framework that simplifies model
deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB
enables users to leverage the power of multilingual models to quickly adapt and
extend existing tools for speech recognition, OCR, translation, and syntactic
analysis to new languages, even with limited training data. We describe various
tools and APIs that are currently available and how developers can easily add
new models/functionality to the framework. Code is available at
https://github.com/neulab/cmulab along with a live demo at https://cmulab.devComment: Live demo at https://cmulab.de
Damaged Type and Areopagitica's Clandestine Printers
Milton’s Areopagitica (1644) is one of the most significant texts in the history of the freedom of the press, and yet the pamphlet’s clandestine printers have successfully eluded identification for over 375 years. By examining distinctive and dam-aged type pieces from 100 pamphlets from the 1640s, this article attributes the print-ing of Milton’s Areopagitica to the London printers Matthew Simmons and Thomas Paine, with the possible involvement of Gregory Dexter. It further reveals a sophisti-cated ideological program of clandestine printing executed collaboratively by Paine and Simmons throughout 1644 and 1645 that includes not only Milton’s Areopagitica but also Roger Williams’s The Bloudy Tenent of Persecution, William Walwyn’s The Compassionate Samaritane, Henry Robinson’s Liberty of Conscience, Robinson’s John the Baptist, and Milton’s Of Education, Tetrachordon, and Colasterion