22 research outputs found

    External Lexical Information for Multilingual Part-of-Speech Tagging

    Get PDF
    Morphosyntactic lexicons and word vector representations have both proven useful for improving the accuracy of statistical part-of-speech taggers. Here we compare the performances of four systems on datasets covering 16 languages, two of these systems being feature-based (MEMMs and CRFs) and two of them being neural-based (bi-LSTMs). We show that, on average, all four approaches perform similarly and reach state-of-the-art results. Yet better performances are obtained with our feature-based models on lexically richer datasets (e.g. for morphologically rich languages), whereas neural-based results are higher on datasets with less lexical variability (e.g. for English). These conclusions hold in particular for the MEMM models relying on our system MElt, which benefited from newly designed features. This shows that, under certain conditions, feature-based approaches enriched with morphosyntactic lexicons are competitive with respect to neural methods

    Étiquetage multilingue en parties du discours avec MElt

    Get PDF
    International audienceWe describe recent evolutions of MElt, a discriminative part-of-speech tagging system. MElt is targeted at the optimal exploitation of information provided by external lexicons for improving its performance over models trained solely on annotated corpora. We have trained MElt on more than 40 datasets covering over 30 languages. Compared with the state-of-the-art system MarMoT, MElt's results are slightly worse on average when no external lexicon is used, but slightly better when such resources are available, resulting in state-of-the-art taggers for a number of languages.Nous présentons des travaux récents réalisés autour de MElt, système discriminant d'étiquetage en parties du discours. MElt met l'accent sur l'exploitation optimale d'informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d'une quarantaine de jeux de données couvrant plus d'une trentaine de langues. Comparé au système état-de-l'art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l'absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l'art pour plusieurs langues

    Empirical studies on word representations

    Get PDF

    Empirical studies on word representations

    Get PDF

    Empirical studies on word representations

    Get PDF
    One of the most fundamental tasks in natural language processing is representing words with mathematical objects (such as vectors). The word representations, which are most often estimated from data, allow capturing the meaning of words. They enable comparing words according to their semantic similarity, and have been shown to work extremely well when included in complex real-world applications. A large part of our work deals with ways of estimating word representations directly from large quantities of text. Our methods exploit the idea that words which occur in similar contexts have a similar meaning. How we define the context is an important focus of our thesis. The context can consist of a number of words to the left and to the right of the word in question, but, as we show, obtaining context words via syntactic links (such as the link between the verb and its subject) often works better. We furthermore investigate word representations that accurately capture multiple meanings of a single word. We show that translation of a word in context contains information that can be used to disambiguate the meaning of that word

    Étiquetage multilingue en parties du discours avec MElt

    Get PDF
    International audienceWe describe recent evolutions of MElt, a discriminative part-of-speech tagging system. MElt is targeted at the optimal exploitation of information provided by external lexicons for improving its performance over models trained solely on annotated corpora. We have trained MElt on more than 40 datasets covering over 30 languages. Compared with the state-of-the-art system MarMoT, MElt's results are slightly worse on average when no external lexicon is used, but slightly better when such resources are available, resulting in state-of-the-art taggers for a number of languages.Nous présentons des travaux récents réalisés autour de MElt, système discriminant d'étiquetage en parties du discours. MElt met l'accent sur l'exploitation optimale d'informations lexicales externes pour améliorer les performances des étiqueteurs par rapport aux modèles entraînés seulement sur des corpus annotés. Nous avons entraîné MElt sur plus d'une quarantaine de jeux de données couvrant plus d'une trentaine de langues. Comparé au système état-de-l'art MarMoT, MElt obtient en moyenne des résultats légèrement moins bons en l'absence de lexique externe, mais meilleurs lorsque de telles ressources sont disponibles, produisant ainsi des étiqueteurs état-de-l'art pour plusieurs langues

    Compounding in Namagowab and English: (exploring meaning creation in compounds)

    Get PDF
    This essay investigates compounding in Namagowab and English, which belong to two widely divergent groups of languages, the Khoesan and Indo-European, respectively. The first motive is to investigate how and why new words are created from existing ones. The reading and data interpretation seeks an understanding of word formation and an overview of semantic compositionality, structure and productivity, within the broad context of cognitive, lexicalist and distributed morphology paradigms. This coupled with history reading about the languages and its people, is used to speculate about why compounds feature in lexical creation. Compounding is prevalent in both languages and their distance in terms of phylogenetic relationships should allow limited generalizing about these processes of formation. Word lists taken from dictionaries in both languages were analyzed by entering the words in Excel spreadsheets so that various attributes of these words, such as word type, compound class (Noun, Verb, Preposition, Adjective and Adverb) and constituent class could be counted, and described with formulae, and compound and constituent meaning analyzed. The conclusion was that socio historical factors such as language contact, and aspects of cognition such as memory and transparency, account for compounding in a language in addition to typology

    Max Planck Institute for Psycholinguistics: Annual report 1996

    No full text

    Proceedings of the VIIth GSCP International Conference

    Get PDF
    The 7th International Conference of the Gruppo di Studi sulla Comunicazione Parlata, dedicated to the memory of Claire Blanche-Benveniste, chose as its main theme Speech and Corpora. The wide international origin of the 235 authors from 21 countries and 95 institutions led to papers on many different languages. The 89 papers of this volume reflect the themes of the conference: spoken corpora compilation and annotation, with the technological connected fields; the relation between prosody and pragmatics; speech pathologies; and different papers on phonetics, speech and linguistic analysis, pragmatics and sociolinguistics. Many papers are also dedicated to speech and second language studies. The online publication with FUP allows direct access to sound and video linked to papers (when downloaded)
    corecore