Search CORE

35 research outputs found

Challenges of Cheap Resource Creation for Morphological Tagging

Author: Feldman Anna
Hana Jirka
Publication venue: Montclair State University Digital Commons
Publication date: 01/07/2010
Field of study

We describe the challenges of resource creation for a resource-light system for morphological tagging of fusional languages (Feldman and Hana, 2010). The constraints on resources (time, expertise, and money) introduce challenges that are not present in development of morphological tools and corpora in the usual, resource intensive way

Montclair State University Digital Commons

Mediating between Incompatible Tagsets

Author: Rosen Alexandr
Publication venue
Publication date: 30/11/2010
Field of study

Proceedings of the Workshop on Annotation and Exploitation of Parallel Corpora AEPC 2010. Editors: Lars Ahrenberg, Jörg Tiedemann and Martin Volk. NEALT Proceedings Series, Vol. 10 (2010), 53-62. © 2010 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/15893

DSpace at Tartu University Library

Towards English-to-Czech MT via Tectogrammatical Layer

Author: Bojar Ondřej
Cinková Silvie
Ptáček Jan
Publication venue
Publication date: 29/11/2007
Field of study

Proceedings of the Sixth International Workshop on Treebanks and Linguistic Theories. Editors: Koenraad De Smedt, Jan Hajič and Sandra Kübler. NEALT Proceedings Series, Vol. 1 (2007), 7-18. © 2007 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/4476

DSpace at Tartu University Library

Cross-lingual Coreference Resolution of Pronouns

Author: Novák Michal
Žabokrtský Zdeněk
Publication venue
Publication date: 01/01/2014
Field of study

This work is, to our knowledge, a first attempt at a machine learning approach to cross-lingual coreference resolution, i.e. coreference resolution (CR) performed on a bitext. Focusing on CR of English pronouns, we leverage language differences and enrich the feature set of a standard monolingual CR system for English with features extracted from the Czech side of the bitext. Our work also includes a supervised pronoun aligner that outperforms a GIZA++ baseline in terms of both intrinsic evaluation and evaluation on CR. The final cross-lingual CR system has successfully outperformed both a monolingual CR and a cross-lingual projection system

Biblio at Institute of Formal and Applied Linguistics

THE MORPHOLOGICALLY ANNOTATED LITHUANIAN CORPUS

Author: Erika Rimkut
Vidas Daudaravi
Vytautas Zinkevi
Publication venue
Publication date: 24/04/2020
Field of study

Abstract The paper deals with the preliminary findings from the morphologically annotated corpus of Lithuanian language (1 million running words). It was compiled and processed at the Center of Computational Linguistics, Vytautas Magnus University. Each annotation for an inflected word form of the corpus contains a lemma and a set of morphological features. The paper presents the strategy for automatic and manual annotation. Automatic annotation was carried out with the help of analyser-lemmatiser. Disambiguation of the homoforms was performed manually. Tag sets and the most prominent features of Lithuanian morphology are discussed in detail. The annotated corpus allowed us to measure the usage of parts of speech and their morphological features in contemporary Lithuanian language. The annotated corpus is of great importance for future development of parsing tools, treebanks and other NLP tools and resources for Lithuanian language

CiteSeerX

Combining Manual and Automatic Annotation of a Learner Corpus

Author: A. Díaz-Negrillo
D. Spoustová
J. Hajič
J. Nouza
S. Granger
T. Jelínek
T. Jelínek
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

Crossref

Rule-restricted Automaton-grammar transducers: Power and Linguistic Applications

Author: Horáček Petr
Meduna Alexander
Čermák Martin
Publication venue: 'Brno University of Technology'
Publication date: 01/01/2012
Field of study

This paper introduces the notion of a new transducer as a two-component system, which consists of a nite automaton and a context-free grammar. In essence, while the automaton reads its input string, the grammar produces its output string, and their cooperation is controlled by a set, which restricts the usage of their rules. From a theoretical viewpoint, the present paper discusses the power of this system working in an ordinary way as well as in a leftmost way. In addition, the paper introduces an appearance checking, which allows us to check whether some symbols are present in the rewritten string, and studies its e ect on the power. It achieves the following three main results. First, the system generates and accepts languages de ned by matrix grammars and partially blind multi-counter automata, respectively. Second, if we place a leftmost restriction on derivation in the context-free grammar, both accepting and generating power of the system is equal to generative power of context-free grammars. Third, the system with appearance checking can accept and generate all recursively enumerable languages. From more pragmatical viewpoint, this paper describes several linguistic applications. A special attention is paid to the Japanese-Czech translation

Digital library of Brno University of Technology

Construction d'un wordnet libre du français à partir de ressources multilingues

Author: Fišer Darja
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 09/06/2008
Field of study

International audienceCet article décrit la construction d'un Wordnet Libre du Français (WOLF) à partir du Princeton WordNet et de diverses ressources multilingues. Les lexèmes polysémiques ont été traités au moyen d'une approche reposant sur l'alignement en mots d'un corpus parallèle en cinq langues. Le lexique multilingue extrait a été désambiguïsé sémantiquement à l'aide des wordnets des langues concernées. Par ailleurs, une approche bilingue a été suffisante pour construire de nouvelles entrées à partir des lexèmes monosémiques. Nous avons pour cela extrait des lexiques bilingues à partir de Wikipédia et de thésaurus. Le wordnet obtenu a été évalué par rapport au wordnet français issu du projet EuroWordNet. Les résultats sont encourageants, et des applications sont d'ores et déjà envisagées

INRIA a CCSD electronic archive server

Hal-Diderot