792 research outputs found

    Building a Morphosyntactic Lexicon and a Pre-syntactic Processing Chain for Polish

    Get PDF
    International audienceThis paper introduces a new set of tools and resources for Polish which cover all the steps required to transform a raw unrestricted text into a reasonable input for a parser. This includes (1) a large-coverage morphological lexicon, developed thanks to the IPI PAN corpus as well as a lexical acquisition techique, and (2) multiple tools for spelling correction, segmentation, tokenization and named entity recognition. This processing chain is also able to deal with the XCES format both as input and output, hence allowing to improve XCES corpora such as the IPI PAN corpus itself. This allows us to give a brief qualitative evaluation of the lexicon and of the processing chain

    External Lexical Information for Multilingual Part-of-Speech Tagging

    Get PDF
    Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

    Words and Subwords: Phonology in a Piece-Based Syntactic Morphology

    Get PDF
    The goal of this dissertation is to take generalizations made in a variety of phonological and morphological theories and account for them in a piece-based syntactic theory of morphology. The theories discussed are Cyclic phonology, Lexical Phonology (and Stratal Optimality Theory), Prosodic Hierarchy Theories, and Syntactic Spell-Out Only theories. Phonological and morphological generalizations from these theories include the cyclic/non-cyclic distinction of phonological blocks and morphemes, ``grammatical\u27\u27 words and phonological words (their equivalence and apparent mismatches), incorporation of clitics into word level phonology, morpheme-sensitive phonological processes, and the relationship between syntactic spell-out phases and phonological domains. I present a framework within the theory of Distributed Morphology (Halle and Marantz 1993, et seq.) in which I account for these generalizations in several ways. I relate as much phonological structure to morphosyntactic structure as possible. However, there are several phonological phenomena which cannot be accounted for by syntactic structure alone. To account for these phenomena, I propose that the syntax feeds information in chunks to PF (cyclic spell-out) but that the morphology and phonology may operate on that information, creating mismatches between syntactic structure and phonological domains. For the cyclic/non-cyclic distinction of phonology, there are mismatches between syntactic spell-out domains and phonological interactions at the subword level. I propose a ``phonocyclic buffer\u27\u27 into which phonologically cyclic exponents are added and over which the cyclic phonology is calculated. This is illustrated with data from yer lowering and yer deletion in Slovak and Polish, English stress and derivational affixes, and Spanish depalatalization. For the relationship between ``grammatical\u27\u27 words and phonological/prosodic words, I propose an interface function relating morphosyntactic words (M-Words; non-minimal complex heads of the syntax) and phonological words. The basic relationship is illustrated with data from English voicing assimilation and German devoicing. I argue against two types of apparent mismatches between M-Words and phonological words, such as those proposed for Japanese ``Aoyagi\u27\u27 prefixes, Vietnamese interleaving word order, Plains Cree polysynthetic verbs, and Spanish compounds. I find some of these apparent mismatches can be handled elsewhere in the phonological system, while others are examples of complex syntactic structure (but not of mismatches between syntactic and phonological structure). I also present an operation which can create phonological words out of non-M-Word configurations, dubbed Stray Terminal Grouping. This is illustrated with data from Bilua, Standard English, and African American Vernacular English. Regarding the behavior of clitics (independent syntactic pieces which are phonological dependent on a host), I find that their behavior is not predetermined or memorized, but is dependent on the morphosyntactic context in which they are derived. I show cases from Turkish, Maltese, and Makassarese in which morphemes variably behave like clitics or affixes depending on their context. I argue that this variable behavior may be determined either by syntactic or morphological operations. Finally, I investigate two types of morpheme-sensitive phonological processes, morphophonological rules and morpheme/morpheme readjustments, illustrated with data from Slavic derived imperfect raising, German umlaut, and Kashaya decrement and palatalization. I argue that these processes are underlyingly phonological in nature, but are activated by morphological diacritics. This activation can happen during two different stages of linearization; Morpheme/morpheme readjustments occur at the level of subword concatenation while morphophonological rules occur at the level of subword chaining. This division accounts for the difference in locality conditions between the two types of processes. The conclusion of this dissertation is that we can account for these phonological generalizations in a piece-based syntactic framework, but not by syntax alone. Rather, it must be a combination of syntactic, morphological, and phonological operations which combine to create the phonological output

    Proceedings

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 98 pages. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Implementing a formal model of inflectional morphology

    Get PDF
    International audienceInflectional morphology as a research topic lies on the crossroads of many linguistic subfields, such as linguistic description, linguistic typology, formal linguistics and computational linguistics. However, the subject itself is tackled with diverse objectives and approaches each time. In this paper, we describe the implementation of a formal model of inflectional morphology capturing typological generalisations that aims at combining efforts made in each subfield giving access to every one of them to valuable methods and/or data that would have been out of range otherwise. We show that both language description and studies in formal morphology and linguistic typology on the one hand, as well as NLP tool and resource development on the other benefit from the availability of such a model and an implementation thereof

    Building a morphological and syntactic lexicon by merging various linguistic resources

    Get PDF
    Proceedings of the 17th Nordic Conference of Computational Linguistics NODALIDA 2009. Editors: Kristiina Jokinen and Eckhard Bick. NEALT Proceedings Series, Vol. 4 (2009), 126-133. © 2009 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/9206

    Design of a Controlled Language for Critical Infrastructures Protection

    Get PDF
    We describe a project for the construction of controlled language for critical infrastructures protection (CIP). This project originates from the need to coordinate and categorize the communications on CIP at the European level. These communications can be physically represented by official documents, reports on incidents, informal communications and plain e-mail. We explore the application of traditional library science tools for the construction of controlled languages in order to achieve our goal. Our starting point is an analogous work done during the sixties in the field of nuclear science known as the Euratom Thesaurus.JRC.G.6-Security technology assessmen

    The Lefff, a freely available and large-coverage morphological and syntactic lexicon for French

    Get PDF
    International audienceIn this paper, we introduce the Lefff , a freely available, accurate and large-coverage morphological and syntactic lexicon for French, used in many NLP tools such as large-coverage parsers. We first describe Alexina, the lexical framework in which the Lefff is developed as well as the linguistic notions and formalisms it is based on. Next, we describe the various sources of lexical data we used for building the Lefff , in particular semi-automatic lexical development techniques and conversion and merging of existing resources. Finally, we illustrate the coverage and precision of the resource by comparing it with other resources and by assessing its impact in various NLP tools

    Enriching Morphological Lexica through Unsupervised Derivational Rule Acquisition

    Get PDF
    WoLeR 2011 is endorsed by FlaReNet, and supported by the Alpage team and the EDyLex French national grant (ANR-09-CORD-008).International audienceIn a morphological lexicon, each entry combines a lemma with a specific inflection class, often defined by a set of inflection rules. Therefore, such lexica usually give a satisfying account of inflectional operations. Derivational information, however, is usually badly covered. In this paper we introduce a novel approach for enriching morphological lexica with derivational links between entries and with new entries derived from existing ones and attested in large-scale corpora, without relying on prior knowledge of possible derivational processes. To achieve this goal, we adapt the unsupervised morphological rule acquisition tool MorphAcq (Nicolas et al., 2010) in a way allowing it to take into account an existing morphological lexicon developed in the Alexina framework (Sagot, 2010), such as the Lefff for French and the Leffe for Spanish. We apply this tool on large corpora, thus uncovering morphological rules that model derivational operations in these two lexica. We use these rules for generating derivation links between existing entries, as well as for deriving new entries from existing ones and adding those which are best attested in a large corpus. In addition to lexicon development and NLP applications that benefit from rich lexical data, such derivational information will be particularly valuable to linguists who rely on vast amounts of data to describe and analyse these specific morphological phenomena
    corecore