Search CORE

8 research outputs found

DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

INRIA a CCSD electronic archive server

Hal-Diderot

DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

INRIA a CCSD electronic archive server

External Lexical Information for Multilingual Part-of-Speech Tagging

Author: Sagot Benoît
Publication venue
Publication date: 01/06/2016
Field of study

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

A Language-Independent Approach to Extracting Derivational Relations from an Inflectional Lexicon

Author: Baranes Marion
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/01/2014
Field of study

International audienceIn this paper, we describe and evaluate an unsupervised method for cquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on surface form analogies and on part-of-speech information. It is generic enough to be applied to any language with a mainly concatenative derivational morphology. Results were obtained and evaluated on English, French, German and Spanish. Precision results are satisfying, and our French results favorably compare with another resource, although its construction relied on manually developed lexicographic information whereas our approach only requires an inflectional lexicon

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Improving neural tagging with lexical information

Author: Martínez Alonso Héctor
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 20/09/2017
Field of study

International audienceNeural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets

INRIA a CCSD electronic archive server

A multilingual collection of CoNLL-U-compatible morphological lexicons

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 07/05/2018
Field of study

International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures

INRIA a CCSD electronic archive server

Développement d'un lexique morphologique et syntaxique de l'ancien français

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/07/2019
Field of study

International audienceIn this paper we describe our work on the development of a large-scale morphological and syntactic lexicon of Old French for natural language processing. We rely on dictionary and lexical resources, from which the extraction of structured and exploitable information required specific developments. In addition, matching information from these different sources posed difficulties. We provide quantitative information on the resulting lexicon, and discuss its reliability in its current version and the prospects for improvement allowed by the existence of a first version, in particular through the automatic analysis of textual data.Nous décrivons dans cet article notre travail de développement d'un lexique morphologique et syntaxique à grande échelle de l'ancien français pour le traitement automatique des langues. Nous nous sommes appuyés sur des ressources dictionnairiques et lexicales dans lesquelles l'extraction d'informations structurées et exploitables a nécessité des développements spécifiques. De plus, la mise en correspondance d'informations provenant de ces différentes sources a soulevé des difficultés. Nous donnons quelques indications quantitatives sur le lexique obtenu, et discutons de sa fiabilité dans sa version actuelle et des perspectives d'amélioration permises par l'existence d'une première version, notamment au travers de l'analyse automatique de données textuelles

INRIA a CCSD electronic archive server

Normalisation orthographique de corpus bruités

Author: Baranes Marion
Publication venue: HAL CCSD
Publication date: 23/10/2015
Field of study

The information contained in messages posted on the Internet (forums, social networks, review sites...) is of strategic importance for many companies. However, few tools have been designed for analysing such messages, the spelling, typography and syntax of which are often noisy. This industrial PhD thesis has been carried out within the viavoo company with the aim of improving the results of a lemma-based information retrieval tool. We have developed a processing pipeline for the normalisation of noisy texts. Its aim is to ensure that each word is assigned the standard spelling corresponding to one of its lemma’s inflected forms. First, among all tokens of the corpus that are unknown to a reference lexicon, we automatically determine which ones result from alterations — and therefore should be normalised — as opposed to those that do not (neologisms, loanwords...). Normalisation candidates are then generated for these tokens using weighted rules obtained by analogy-based machine learning techniques. Next we identify tokens that are known to the reference lexicon but are nevertheless the result of an alteration (grammatical errors), and generate normalisation candidates for each of them. Finally, language models allow us to perform a context-sensitive disambiguation of the normalisation candidates generated for all types of alterations. Numerous experiments and evaluations are carried out on French data for each module and for the overall pipeline. Special attention has been paid to keep all modules as language-independent as possible, which paves the way for future adaptations of our pipeline to other European languages.Les messages publiés par les internautes comportent un intérêt stratégique pour les entreprises. Néanmoins, peu d’outils ont été conçus pour faciliter l'analyse de ces messages souvent bruités. Cette thèse, réalisée au sein de l'entreprise viavoo, veut améliorer les résultats d’un outil d'extraction d'information qui fait abstraction de la variabilité flexionnelle. Nous avons ainsi développé une chaîne de traitements pour la normalisation orthographique de textes bruités. Notre approche consiste tout d'abord à déterminer automatiquement, parmi les tokens du corpus traité qui sont inconnus d'un lexique, ceux qui résultent d’altérations et qu'il conviendrait de normaliser, par opposition aux autres (néologismes, emprunts...). Des candidats de normalisation sont alors proposés pour ces tokens à l'aide de règles pondérées obtenues par des techniques d'apprentissage par analogie. Nous identifions ensuite des tokens connus du lexique qui résultent néanmoins d’une altération (fautes grammaticales), et proposons des candidats de normalisation pour ces tokens. Enfin, des modèles de langue permettent de prendre en compte le contexte dans lequel apparaissent les différents types d'altérations pour lesquels des candidats de normalisation ont été proposés afin de choisir les plus probables. Différentes expériences et évaluations sont réalisées sur le français à chaque étape et sur la chaîne complète. Une attention particulière a été portée au caractère faiblement dépendant de la langue des modules développés, ce qui permet d'envisager son adaptation à d'autres langues européennes

Thèses en Ligne

INRIA a CCSD electronic archive server

Hal-Diderot