8 research outputs found

    DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

    Get PDF
    International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

    DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German

    Get PDF
    International audienceWe introduce DeLex, a freely-avaible, large-scale and linguistically grounded morphological lexicon for German developed within the Alexina framework. We extracted lexical information from the German wiktionary and developed a morphological inflection grammar for German, based on a linguistically sound model of inflectional morphology. Although the developement of DeLex involved some manual work, we show that is represents a good tradeoff between development cost, lexical coverage and resource accuracy

    External Lexical Information for Multilingual Part-of-Speech Tagging

    Get PDF
    Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

    A Language-Independent Approach to Extracting Derivational Relations from an Inflectional Lexicon

    Get PDF
    International audienceIn this paper, we describe and evaluate an unsupervised method for cquiring pairs of lexical entries belonging to the same morphological family, i.e., derivationally related words, starting from a purely inflectional lexicon. Our approach relies on transformation rules that relate lexical entries with the one another, and which are automatically extracted from the inflected lexicon based on surface form analogies and on part-of-speech information. It is generic enough to be applied to any language with a mainly concatenative derivational morphology. Results were obtained and evaluated on English, French, German and Spanish. Precision results are satisfying, and our French results favorably compare with another resource, although its construction relied on manually developed lexicographic information whereas our approach only requires an inflectional lexicon

    Improving neural tagging with lexical information

    Get PDF
    International audienceNeural part-of-speech tagging has achieved competitive results with the incorporation of character-based and pre-trained word embeddings. In this paper, we show that a state-of-the-art bi-LSTM tagger can benefit from using information from morphosyntactic lexicons as additional input. The tagger, trained on several dozen languages, shows a consistent, average improvement when using lexical information, even when also using character-based embeddings, thus showing the complementarity of the different sources of lexical information. The improvements are particularly important for the smaller datasets

    A multilingual collection of CoNLL-U-compatible morphological lexicons

    Get PDF
    International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures

    Développement d'un lexique morphologique et syntaxique de l'ancien français

    Get PDF
    International audienceIn this paper we describe our work on the development of a large-scale morphological and syntactic lexicon of Old French for natural language processing. We rely on dictionary and lexical resources, from which the extraction of structured and exploitable information required specific developments. In addition, matching information from these different sources posed difficulties. We provide quantitative information on the resulting lexicon, and discuss its reliability in its current version and the prospects for improvement allowed by the existence of a first version, in particular through the automatic analysis of textual data.Nous décrivons dans cet article notre travail de développement d'un lexique morphologique et syntaxique à grande échelle de l'ancien français pour le traitement automatique des langues. Nous nous sommes appuyés sur des ressources dictionnairiques et lexicales dans lesquelles l'extraction d'informations structurées et exploitables a nécessité des développements spécifiques. De plus, la mise en correspondance d'informations provenant de ces différentes sources a soulevé des difficultés. Nous donnons quelques indications quantitatives sur le lexique obtenu, et discutons de sa fiabilité dans sa version actuelle et des perspectives d'amélioration permises par l'existence d'une première version, notamment au travers de l'analyse automatique de données textuelles

    Normalisation orthographique de corpus bruités

    Get PDF
    The information contained in messages posted on the Internet (forums, social networks, review sites...) is of strategic importance for many companies. However, few tools have been designed for analysing such messages, the spelling, typography and syntax of which are often noisy. This industrial PhD thesis has been carried out within the viavoo company with the aim of improving the results of a lemma-based information retrieval tool. We have developed a processing pipeline for the normalisation of noisy texts. Its aim is to ensure that each word is assigned the standard spelling corresponding to one of its lemma’s inflected forms. First, among all tokens of the corpus that are unknown to a reference lexicon, we automatically determine which ones result from alterations — and therefore should be normalised — as opposed to those that do not (neologisms, loanwords...). Normalisation candidates are then generated for these tokens using weighted rules obtained by analogy-based machine learning techniques. Next we identify tokens that are known to the reference lexicon but are nevertheless the result of an alteration (grammatical errors), and generate normalisation candidates for each of them. Finally, language models allow us to perform a context-sensitive disambiguation of the normalisation candidates generated for all types of alterations. Numerous experiments and evaluations are carried out on French data for each module and for the overall pipeline. Special attention has been paid to keep all modules as language-independent as possible, which paves the way for future adaptations of our pipeline to other European languages.Les messages publiés par les internautes comportent un intérêt stratégique pour les entreprises. Néanmoins, peu d’outils ont été conçus pour faciliter l'analyse de ces messages souvent bruités. Cette thèse, réalisée au sein de l'entreprise viavoo, veut améliorer les résultats d’un outil d'extraction d'information qui fait abstraction de la variabilité flexionnelle. Nous avons ainsi développé une chaîne de traitements pour la normalisation orthographique de textes bruités. Notre approche consiste tout d'abord à déterminer automatiquement, parmi les tokens du corpus traité qui sont inconnus d'un lexique, ceux qui résultent d’altérations et qu'il conviendrait de normaliser, par opposition aux autres (néologismes, emprunts...). Des candidats de normalisation sont alors proposés pour ces tokens à l'aide de règles pondérées obtenues par des techniques d'apprentissage par analogie. Nous identifions ensuite des tokens connus du lexique qui résultent néanmoins d’une altération (fautes grammaticales), et proposons des candidats de normalisation pour ces tokens. Enfin, des modèles de langue permettent de prendre en compte le contexte dans lequel apparaissent les différents types d'altérations pour lesquels des candidats de normalisation ont été proposés afin de choisir les plus probables. Différentes expériences et évaluations sont réalisées sur le français à chaque étape et sur la chaîne complète. Une attention particulière a été portée au caractère faiblement dépendant de la langue des modules développés, ce qui permet d'envisager son adaptation à d'autres langues européennes
    corecore