20 research outputs found

    Does the Geometry of Word Embeddings Help Document Classification? A Case Study on Persistent Homology Based Representations

    Full text link
    We investigate the pertinence of methods from algebraic topology for text data analysis. These methods enable the development of mathematically-principled isometric-invariant mappings from a set of vectors to a document embedding, which is stable with respect to the geometry of the document in the selected metric space. In this work, we evaluate the utility of these topology-based document representations in traditional NLP tasks, specifically document clustering and sentiment classification. We find that the embeddings do not benefit text analysis. In fact, performance is worse than simple techniques like tf-idf\textit{tf-idf}, indicating that the geometry of the document does not provide enough variability for classification on the basis of topic or sentiment in the chosen datasets.Comment: 5 pages, 3 figures. Rep4NLP workshop at ACL 201

    Zero-shot Neural Transfer for Cross-lingual Entity Linking

    Full text link
    Cross-lingual entity linking maps an entity mention in a source language to its corresponding entry in a structured knowledge base that is in a different (target) language. While previous work relies heavily on bilingual lexical resources to bridge the gap between the source and the target languages, these resources are scarce or unavailable for many low-resource languages. To address this problem, we investigate zero-shot cross-lingual entity linking, in which we assume no bilingual lexical resources are available in the source low-resource language. Specifically, we propose pivot-based entity linking, which leverages information from a high-resource "pivot" language to train character-level neural entity linking models that are transferred to the source low-resource language in a zero-shot manner. With experiments on 9 low-resource languages and transfer through a total of 54 languages, we show that our proposed pivot-based framework improves entity linking accuracy 17% (absolute) on average over the baseline systems, for the zero-shot scenario. Further, we also investigate the use of language-universal phonological representations which improves average accuracy (absolute) by 36% when transferring between languages that use different scripts.Comment: To appear in AAAI 201

    OCR Post Correction for Endangered Language Texts

    Full text link
    There is little to no data available to build natural language processing models for most endangered languages. However, textual data in these languages often exists in formats that are not machine-readable, such as paper books and scanned images. In this work, we address the task of extracting text from these resources. We create a benchmark dataset of transcriptions for scanned books in three critically endangered languages and present a systematic analysis of how general-purpose OCR tools are not robust to the data-scarce setting of endangered languages. We develop an OCR post-correction method tailored to ease training in this data-scarce setting, reducing the recognition error rate by 34% on average across the three languages.Comment: Accepted to EMNLP 202

    CMULAB: An Open-Source Framework for Training and Deployment of Natural Language Processing Models

    Full text link
    Effectively using Natural Language Processing (NLP) tools in under-resourced languages requires a thorough understanding of the language itself, familiarity with the latest models and training methodologies, and technical expertise to deploy these models. This could present a significant obstacle for language community members and linguists to use NLP tools. This paper introduces the CMU Linguistic Annotation Backend, an open-source framework that simplifies model deployment and continuous human-in-the-loop fine-tuning of NLP models. CMULAB enables users to leverage the power of multilingual models to quickly adapt and extend existing tools for speech recognition, OCR, translation, and syntactic analysis to new languages, even with limited training data. We describe various tools and APIs that are currently available and how developers can easily add new models/functionality to the framework. Code is available at https://github.com/neulab/cmulab along with a live demo at https://cmulab.devComment: Live demo at https://cmulab.de

    Damaged Type and Areopagitica's Clandestine Printers

    Get PDF
    Milton’s Areopagitica (1644) is one of the most significant texts in the history of the freedom of the press, and yet the pamphlet’s clandestine printers have successfully eluded identification for over 375 years. By examining distinctive and dam-aged type pieces from 100 pamphlets from the 1640s, this article attributes the print-ing of Milton’s Areopagitica to the London printers Matthew Simmons and Thomas Paine, with the possible involvement of Gregory Dexter. It further reveals a sophisti-cated ideological program of clandestine printing executed collaboratively by Paine and Simmons throughout 1644 and 1645 that includes not only Milton’s Areopagitica but also Roger Williams’s The Bloudy Tenent of Persecution, William Walwyn’s The Compassionate Samaritane, Henry Robinson’s Liberty of Conscience, Robinson’s John the Baptist, and Milton’s Of Education, Tetrachordon, and Colasterion
    corecore