453 research outputs found
Building a morphological and syntactic lexicon by merging various linguistic resources
Proceedings of the 17th Nordic Conference of Computational Linguistics
NODALIDA 2009.
Editors: Kristiina Jokinen and Eckhard Bick.
NEALT Proceedings Series, Vol. 4 (2009), 126-133.
© 2009 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/9206
DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German
International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy
A multilingual collection of CoNLL-U-compatible morphological lexicons
International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures
Enriching Morphological Lexica through Unsupervised Derivational Rule Acquisition
WoLeR 2011 is endorsed by FlaReNet, and supported by the Alpage team and the EDyLex French national grant (ANR-09-CORD-008).International audienceIn a morphological lexicon, each entry combines a lemma with a specific inflection class, often defined by a set of inflection rules. Therefore, such lexica usually give a satisfying account of inflectional operations. Derivational information, however, is usually badly covered. In this paper we introduce a novel approach for enriching morphological lexica with derivational links between entries and with new entries derived from existing ones and attested in large-scale corpora, without relying on prior knowledge of possible derivational processes. To achieve this goal, we adapt the unsupervised morphological rule acquisition tool MorphAcq (Nicolas et al., 2010) in a way allowing it to take into account an existing morphological lexicon developed in the Alexina framework (Sagot, 2010), such as the Lefff for French and the Leffe for Spanish. We apply this tool on large corpora, thus uncovering morphological rules that model derivational operations in these two lexica. We use these rules for generating derivation links between existing entries, as well as for deriving new entries from existing ones and adding those which are best attested in a large corpus. In addition to lexicon development and NLP applications that benefit from rich lexical data, such derivational information will be particularly valuable to linguists who rely on vast amounts of data to describe and analyse these specific morphological phenomena
Coupling an Annotated Corpus and a Morphosyntactic Lexicon for State-of-the-Art POS Tagging with Less Human Effort
PACLIC 23 / City University of Hong Kong / 3-5 December 200
Implementing a formal model of inflectional morphology
International audienceInflectional morphology as a research topic lies on the crossroads of many linguistic subfields, such as linguistic description, linguistic typology, formal linguistics and computational linguistics. However, the subject itself is tackled with diverse objectives and approaches each time. In this paper, we describe the implementation of a formal model of inflectional morphology capturing typological generalisations that aims at combining efforts made in each subfield giving access to every one of them to valuable methods and/or data that would have been out of range otherwise. We show that both language description and studies in formal morphology and linguistic typology on the one hand, as well as NLP tool and resource development on the other benefit from the availability of such a model and an implementation thereof
Coupling an annotated corpus and a lexicon for state-of-the-art POS tagging
International audienceThis paper investigates how to best couple hand-annotated data with information extracted from an external lexical resource to improve POS tagging performance. Focusing on French tagging, we introduce a maximum entropy conditional sequence tagging system that is enriched with information extracted from a morphological resource. This system gives a 97.7% accuracy on the French Treebank, an error reduction of 23% (28% on unknown words) over the same tagger without lexical information. We also conduct experiments on datasets and lexicons of varying sizes in order to assess the best trade-off between annotating data vs. developing a lexicon. We find that the use of a lexicon improves the quality of the tagger at any stage of development of either resource, and that for fixed performance levels the availability of the full lexicon consistently reduces the need for supervised data by at least one half
Transformer-based Automatic Speech Recognition of Formal and Colloquial Czech in MALACH Project
Czech is a very specific language due to its large differences between the
formal and the colloquial form of speech. While the formal (written) form is
used mainly in official documents, literature, and public speeches, the
colloquial (spoken) form is used widely among people in casual speeches. This
gap introduces serious problems for ASR systems, especially when training or
evaluating ASR models on datasets containing a lot of colloquial speech, such
as the MALACH project. In this paper, we are addressing this problem in the
light of a new paradigm in end-to-end ASR systems -- recently introduced
self-supervised audio Transformers. Specifically, we are investigating the
influence of colloquial speech on the performance of Wav2Vec 2.0 models and
their ability to transcribe colloquial speech directly into formal transcripts.
We are presenting results with both formal and colloquial forms in the training
transcripts, language models, and evaluation transcripts.Comment: to be published in Proceedings of TSD 202
The Lefff 2 syntactic lexicon for French: architecture, acquisition, use
International audienceIn this paper, we introduce a new lexical resource for French which is freely available as the second version of the Lefff (Lexique des formes fl ́echies du franc ̧ ais – Lexicon of French inflected forms). It is a wide-coverage morphosyntactic and syntactic lexicon, whose architecture relies on properties inheritance, which makes it more compact and more easily maintainable and allows to describe lexical entries independantly from the formalisms it is used for. For these two reasons, we define it as a meta-lexicon. We describe its architecture, several automatic or semi-automatic approaches we use to acquire, correct and/or enrich such a lexicon, as well as the way it is used both with an LFG parser and with a TAG parser based on a meta-grammar, so as to build two large-coverage parsers for French
Crowdsourcing for Language Resource Development: Criticisms About Amazon Mechanical Turk Overpowering Use
International audienceThis article is a position paper about Amazon Mechanical Turk, the use of which has been steadily growing in language processing in the past few years. According to the mainstream opinion expressed in articles of the domain, this type of on-line working platforms allows to develop quickly all sorts of quality language resources, at a very low price, by people doing that as a hobby. We shall demonstrate here that the situation is far from being that ideal. Our goal here is manifold: 1- to inform researchers, so that they can make their own choices, 2- to develop alternatives with the help of funding agencies and scientific associations, 3- to propose practical and organizational solutions in order to improve language resources development, while limiting the risks of ethical and legal issues without letting go price or quality, 4- to introduce an Ethics and Big Data Charter for the documentation of language resourc
- …