Search CORE

409 research outputs found

External Lexical Information for Multilingual Part-of-Speech Tagging

Author: Sagot Benoît
Publication venue
Publication date: 01/06/2016
Field of study

Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

Hal-Diderot

Comparing Complexity Measures

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 22/02/2013
Field of study

International audienc

INRIA a CCSD electronic archive server

Hal-Diderot

DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 26/05/2014
Field of study

International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

INRIA a CCSD electronic archive server

Hal-Diderot

Étiquetage multilingue en parties du discours avec MElt

Author: Sagot Benoît
Publication venue: HAL CCSD
Publication date: 04/07/2016
Field of study

International audienceWe describe recent evolutions of MElt, a discriminative part-of-speech tagging system. MElt is targeted at the optimal exploitation of information provided by external lexicons for improving its performance over models trained solely on annotated corpora. We have trained MElt on more than 40 datasets covering over 30 languages. Compared with the state-of-the-art system MarMoT, MElt's results are slightly worse on average when no external lexicon is used, but slightly better when such resources are available, resulting in state-of-the-art taggers for a number of languages.Nous présentons des travaux récents réalisés autour de MElt, système discriminant d'étiquetage en parties du discours. MElt met l'accent sur l'exploitation optimale d'informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d'une quarantaine de jeux de données couvrant plus d'une trentaine de langues. Comparé au système état-de-l'art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l'absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l'art pour plusieurs langues

INRIA a CCSD electronic archive server

Hal-Diderot

Building a free French wordnet from multilingual resources

Author: Fišer Darja
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 31/05/2008
Field of study

International audienceThis paper describes automatic construction a freely-available wordnet for French (WOLF) based on Princeton WordNet (PWN) by using various multilingual resources. Polysemous words were dealt with an approach in which a parallel corpus for five languages was word-aligned and the extracted multilingual lexicon was disambiguated with the existing wordnets for these languages. On the other hand, a bilingual approach sufficed to acquire equivalents for monosemous words. Bilingual lexicons were extracted from Wikipedia and thesauri. The results obtained from each resource were merged and ranked according to the number of resources yielding the same literal. Automatic evaluation of the merged wordnet was performed with the French WordNet (FREWN). Manual evaluation was also carried out on a sample of the generated synsets. Precision shows that the presented approach has proved to be very promising and applications to use the created wordnet are already intended

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot

Verbes de citation et Tables du Lexique-Grammaire

Author: Danlos Laurence
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/09/2010
Field of study

International audienceCet article se propose d'étudier systématiquement comment et où se répartissent les verbes qui peuvent être la tête d'une incise de citation dans les tables de verbes simples du lexique-grammaire (LG). Dans l'état actuel, seule la Table 9 code cette propriété (colonne 'P', V N0 à N2)

INRIA a CCSD electronic archive server

Hal-Diderot

Could Greek and Italic share a same Indo-European substratum?

Author: Garnier Romain
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 27/07/2015
Field of study

International audienceGreek and Latin have developed from their common Proto-Indo-European (PIE) ancestor in distinct ways, resulting in two languages that exhibit very different features, in particular regarding phonology and Wortbildung. Moreover, the Greek lexicon has long been recognised for its huge proportion of non-inherited words, among which it is difficult to draw a clear distinction between substrata and loan words. Several of the languages that contributed to shaping the Greek lexicon are Indo-European. Among the Indo-European contributors to the non-inherited Greek lexicon, we tentatively identify a language that shares phonetic and morphological features with substratic elements attested in Italic, and possibly articulatory properties of Latin itself. We shall review five phonetic features of this language: (i) voiceless reflexes of PIE voiced aspirated stops; (ii) the anticipation of nasals resembling lex-unda in Latin but generalised to labial stops, such that VCnV > VnGV with lenition of the consonant; (iii) a velarised /ł/ (viz. l pinguis) which can trigger an anaptyctic -ŏ- or -ŭ-; (iv) apparent voice alternations that follow similar patterns to the Verner law in Germanic; (v) the metathesis of -r-, such that CVrC > CrVC. Our study also unveils morphological peculiarities of this language: (a) the frequent use of elsewhere poorly attested labial morphs, leading to nouns of the form *CóC-Po- and adjectives of the form *CoC-Pó-; (b) the frequent use of a prefix *eǵhs- (cf. Lat. ex-, Gr. ἐξ-) reflected as a simple *s-; (c) the frequent occurrence of action nouns built with the well-known *CóC-no- pattern

HAL-UNILIM

HAL Clermont Université

INRIA a CCSD electronic archive server

Hal-Diderot

Intégrer les tables du Lexique-Grammaire à un analyseur syntaxique robuste à grande échelle

Author: Sagot Benoît
Tolone Elsa
Publication venue: HAL CCSD
Publication date: 24/06/2009
Field of study

National audienceIn this paper, we describe how we converted the lexicon-grammar tables into an NLP format, that of the Lefff lexicon, which allowed us to integrate it into the FRMG parser. We decribe the linguistic basis of this conversion process, and the resulting lexicon. We validate the resulting lexicon by evaluating the FRMG parser on the EASy reference corpus depending on the set of verbal entries it relies on, namely those of the Lefff or those of the converted lexicon-grammar verb tables

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

Merging syntactic lexica: the case for French verbs

Author: Danlos Laurence
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 22/05/2012
Field of study

International audienceSyntactic lexicons, which associate each lexical entry with information such as valency, are crucial for several natural language processing tasks, such as parsing. However, because they contain a rich and complex information, they are very costly to develop. In this paper, we show how syntactic lexical resources can be merged, in order to take benefit from their respective strong points, and despite the disparities in the way they represent syntactic lexical information. We illustrate our methodology with the example of French verbs. We describe four large-coverage syntactic lexicons for this language, among which the Lefff, and show how we were able, using our merging algorithm, to extend and improve the Lefff

INRIA a CCSD electronic archive server

Hal-Diderot

Normalisation de textes par analogie: le cas des mots inconnus

Author: Baranes Marion
Sagot Benoît
Publication venue: HAL CCSD
Publication date: 01/07/2014
Field of study

International audienceAnalogy-based Text Normalization : the case of unknowns words. In this paper, we describe and evaluate a system for improving the quality of noisy texts containing non-word errors. It is meant to be integrated into a full information extraction architecture, and aims at improving its results. For each word unknown to a reference lexicon which is neither a named entity nor a neologism, our system suggests one or several normalization candidates (any known word which has the same lemma as the spell-corrected form is a valid candidate). For this purpose, we use an analogy-based approach for acquiring normalisation rules and use them in the same way as lexical spelling correction rules.Dans cet article, nous proposons et évaluons un système permettant d'améliorer la qualité d'un texte bruité notamment par des erreurs orthographiques. Ce système a vocation à être intégré à une architecture complète d'extraction d'information, et a pour objectif d'améliorer les résultats d'une telle tâche. Pour chaque mot qui est inconnu d'un lexique de référence et qui n'est ni une entité nommée ni une création lexicale, notre système cherche à proposer une ou plusieurs normalisations possibles (une normalisation valide étant un mot connu dont le lemme est le même que celui de la forme orthographiquement correcte). Pour ce faire, ce système utilise des techniques de correction automatique lexicale par règle qui reposent sur un système d'induction de règles par analogie

INRIA a CCSD electronic archive server

Hal-Diderot