3,745 research outputs found

    Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure

    Get PDF
    It has been established that incorporating word cluster features derived from large unlabeled corpora can significantly improve prediction of linguistic structure. While previous work has focused primarily on English, we extend these results to other languages along two dimensions. First, we show that these results hold true for a number of languages across families. Second, and more interestingly, we provide an algorithm for inducing cross-lingual clusters and we show that features derived from these clusters significantly improve the accuracy of cross-lingual structure prediction. Specifically, we show that by augmenting direct-transfer systems with cross-lingual cluster features, the relative error of delexicalized dependency parsers, trained on English treebanks and transferred to foreign languages, can be reduced by up to 13%. When applying the same method to direct transfer of named-entity recognizers, we observe relative improvements of up to 26%

    Universal Dependencies for Learner English

    Get PDF
    We introduce the Treebank of Learner English (TLE), the first publicly available syntactic treebank for English as a Second Language (ESL). The TLE provides manually annotated POS tags and Universal Dependency (UD) trees for 5,124 sentences from the Cambridge First Certificate in English (FCE) corpus. The UD annotations are tied to a pre-existing error annotation of the FCE, whereby full syntactic analyses are provided for both the original and error corrected versions of each sentence. Further on, we delineate ESL annotation guidelines that allow for consistent syntactic treatment of ungrammatical English. Finally, we benchmark POS tagging and dependency parsing performance on the TLE dataset and measure the effect of grammatical errors on parsing accuracy. We envision the treebank to support a wide range of linguistic and computational research o n second language acquisition as well as automatic processing of ungrammatical language.This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF – 1231216

    Deep Learning: Our Miraculous Year 1990-1991

    Full text link
    In 2020, we will celebrate that many of the basic ideas behind the deep learning revolution were published three decades ago within fewer than 12 months in our "Annus Mirabilis" or "Miraculous Year" 1990-1991 at TU Munich. Back then, few people were interested, but a quarter century later, neural networks based on these ideas were on over 3 billion devices such as smartphones, and used many billions of times per day, consuming a significant fraction of the world's compute.Comment: 37 pages, 188 references, based on work of 4 Oct 201

    VALICO-UD: annotating an Italian learner corpus

    Get PDF
    Previous work on learner language has highlighted the importance of having annotated resources to describe the development of interlanguage. Despite this, few learner resources, mainly for English L2, feature error and syntactic annotation. This thesis describes the development of a novel parallel learner Italian treebank, VALICO-UD. Its name suggests two main points: where the data comes from—i.e. the corpus VALICO, a collection of non-native Italian texts elicited by comic strips—and what formalism is used for linguistic annotation—i.e. Universal Dependencies (UD) formalism. It is a parallel treebank because the resource provides for each learner sentence (LS) a target hypothesis (TH) (i.e., parallel corrected version written by an Italian native speaker) which is in turn annotated in UD. We developed this treebank to be exploitable for interlanguage research and comparable with the resources employed in Natural Language Processing tasks such as Native Language Identification or Grammatical Error Identification and Correction. VALICO-UD is composed of 237 texts written by English, French, German and Spanish native speakers, which correspond to 2,234 LSs, each associated with a single TH. While all LSs and THs were automatically annotated using UDPipe, only a portion of the treebank made of 398 LSs plus correspondent THs has been manually corrected and released in May 2021 in the UD repository. This core section features also an explicit XML-based annotation of the errors occurring in each sentence. Thus, the treebank is currently organized in two sections: the core gold standard—comprising 398 LSs and their correspondent THs—and the silver standard—consisting of 1,836 LSs and their correspondent THs. In order to contribute to the computational investigation about the peculiar type of texts included in VALICO-UD, this thesis describes the annotation schema of the resource, provides some preliminary tests about the performance of UDPipe models on this treebank, reports on inter-annotator agreement results for both error and linguistic annotation, and suggests some possible applications

    How Universal is Genre in Universal Dependencies?

    Get PDF
    This work provides the first in-depth analysis of genre in Universal Dependencies (UD). In contrast to prior work on genre identification which uses small sets of well-defined labels in mono-/bilingual setups, UD contains 18 genres with varying degrees of specificity spread across 114 languages. As most treebanks are labeled with multiple genres while lacking annotations about which instances belong to which genre, we propose four methods for predicting instance-level genre using weak supervision from treebank metadata. The proposed methods recover instance-level genre better than competitive baselines as measured on a subset of UD with labeled instances and adhere better to the global expected distribution. Our analysis sheds light on prior work using UD genre metadata for treebank selection, finding that metadata alone are a noisy signal and must be disentangled within treebanks before it can be universally applied.Comment: Accepted at SyntaxFest 202

    AAA4LLL - Acquisition, Annotation, Augmentation for Lively Language Learning

    Get PDF
    In this paper we describe a method for enhancing the process of studying Japanese by a user-centered approach. This approach includes three parts: an innovative way of acquiring learning material from topic seeds, multifaceted sentence analysis to present sentence annotations, and the browser-integrated augmentation of perusing Wikipedia pages of special interest for the learner. This may result in new topic seeds to yield additional learning content, thus repeating the cycle
    • …
    corecore