35 research outputs found

    Challenges of Cheap Resource Creation for Morphological Tagging

    Get PDF
    We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional languages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphological tools and corpora in the usual, resource intensive way

    Mediating between Incompatible Tagsets

    Get PDF
    Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 53-62. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

    Towards English-to-Czech MT via Tectogrammatical Layer

    Get PDF
    Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 7-18. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

    Cross-lingual Coreference Resolution of Pronouns

    Get PDF
    This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system


    Get PDF
    Abstract The paper deals with the preliminary findings from the morphologically annotated corpus of Lithuanian language (1 million running words). It was compiled and processed at the Center of Computational Linguistics, Vytautas Magnus University. Each annotation for an inflected word form of the corpus contains a lemma and a set of morphological features. The paper presents the strategy for automatic and manual annotation. Automatic annotation was carried out with the help of analyser-lemmatiser. Disambiguation of the homoforms was performed manually. Tag sets and the most prominent features of Lithuanian morphology are discussed in detail. The annotated corpus allowed us to measure the usage of parts of speech and their morphological features in contemporary Lithuanian language. The annotated corpus is of great importance for future development of parsing tools, treebanks and other NLP tools and resources for Lithuanian language

    Combining Manual and Automatic Annotation of a Learner Corpus

    Full text link

    Rule-restricted Automaton-grammar transducers: Power and Linguistic Applications

    Get PDF
    This paper introduces the notion of a new transducer as a two-component system, which consists of a nite automaton and a context-free grammar. In essence, while the automaton reads its input string, the grammar produces its output string, and their cooperation is controlled by a set, which restricts the usage of their rules. From a theoretical viewpoint, the present paper discusses the power of this system working in an ordinary way as well as in a leftmost way. In addition, the paper introduces an appearance checking, which allows us to check whether some symbols are present in the rewritten string, and studies its e ect on the power. It achieves the following three main results. First, the system generates and accepts languages de ned by matrix grammars and partially blind multi-counter automata, respectively. Second, if we place a leftmost restriction on derivation in the context-free grammar, both accepting and generating power of the system is equal to generative power of context-free grammars. Third, the system with appearance checking can accept and generate all recursively enumerable languages. From more pragmatical viewpoint, this paper describes several linguistic applications. A special attention is paid to the Japanese-Czech translation

    Construction d'un wordnet libre du français à partir de ressources multilingues

    Get PDF
    International audienceCet article décrit la construction d'un Wordnet Libre du Français (WOLF) à partir du Princeton WordNet et de diverses ressources multilingues. Les lexèmes polysémiques ont été traités au moyen d'une approche reposant sur l'alignement en mots d'un corpus parallèle en cinq langues. Le lexique multilingue extrait a été désambiguïsé sémantiquement à l'aide des wordnets des langues concernées. Par ailleurs, une approche bilingue a été suffisante pour construire de nouvelles entrées à partir des lexèmes monosémiques. Nous avons pour cela extrait des lexiques bilingues à partir de Wikipédia et de thésaurus. Le wordnet obtenu a été évalué par rapport au wordnet français issu du projet EuroWordNet. Les résultats sont encourageants, et des applications sont d'ores et déjà envisagées