3 research outputs found

    Étiquetage morphosyntaxique de langues non dotées à partir de ressources pour une langue étymologiquement proche

    Get PDF
    International audienceWe introduce a generic approach for transferring part-of-speech annotations from a resourced language to a non-resourced but etymologically close language. We do not rely on the existence of any parallel corpora or any linguistic knowledge for the non-resourced language (no lexicons, no annotated corpora). Our approach only makes use of cognate pairs that are automatically induced in an unsupervised way, based on character-based statistical machine translation and on a morphosyntactic lexicon for the resourced language. Frequent and short words are treated differently, as we tag them directly based on a cross-language similarity assessment of immediate morphosyntactic contexts. Using German as a resourced language, we evaluate our approach on Dutch --- in fact a resourced language --- and on Palatine German. We reach tagging accuracies of 67.2% on Dutch and 60.7% on Palatine German.Nous présentons une approche générique du transfert d'annotations morphosyntaxiques d'une langue dotée vers une langue non dotée étymologiquement proche. Nous ne présupposons aucun corpus parallèle et aucune connaissance préalable de la langue non dotée (ni lexique, ni corpus annoté). Notre approche repose uniquement sur des paires de cognats obtenues par apprentissage non-supervisé selon le paradigme de la traduction automatique statistique à base de caractères, et sur un dictionnaire morphosyntaxique de la langue dotée. Pour les mots fréquents et courts, nous préférons assigner les étiquettes directement aux mots de la langue non dotée en fonction de mesures de similarité inter-langues du contexte morphosyntaxique immédiat. Partant de l'allemand comme langue dotée, nous évaluons notre approche sur le néerlandais, qui est en réalité dotée, et le palatin. Nous obtenons une précision d'étiquetage de 67,2\% pour le néerlandais et de 60,7\% pour le palatin

    Bilingual lexicon extraction from comparable corpora for closely related languages

    No full text
    Abstract In this paper we present a knowledge-light approach to extract a bilingual lexicon for closely related languages from comparable corpora. While in most related work an existing dictionary is used to translate context vectors, we take advantage of the similarities between languages instead and build a seed lexicon from words that are identical in both languages and then further extend it with context-based cognates and translations of the most frequent words. We also use cognates for reranking translation candidates obtained via context similarity and extract translation equivalents for all content words, not just nouns as in most related work. The results are very encouraging, suggesting that other similar languages could benefit from the same approach. By enlarging the seed lexicon with cognates and translations of the most frequent words and by cognate-based reranking of translation candidates we were able to improve the average baseline precision from 0.592 to 0.797 on the mean reciprocal rank for the ten top-ranking translation candidates for nouns, verbs and adjectives with a 46% recall on the gold standard of 1000 random entries from a traditional dictionary
    corecore