24 research outputs found
Developing a Comprehensive Standard Persian Positional Tagset
One of the primary tools used in text processing tasks such as information retrieval, text extraction, and text mining, is a corpus that is enhnaced by linguistic tags. In a corpus development effort, the role of a POS-tagger is to assign a linguistic tag to every textual token. POS annotation relies heavily on a tagset based on a linguistic theory. Text processing in Persian, too, follows this common practice. Several tagsets have been introduced, so far, to annotate Persian corpora. However, each tagset has followed a specific standard and linguistic theory. The resulting tagsets contain a limited number of tags, which renders them inadequate for a larger scope of research. This study is inspired by EAGLES, MULTEXT-East, positional tagset standards to produce a comprehensive standard positional tagset for Persian. The proposed tagset is also informed by the existing Persian tagsets. The proposed Persian Positional Tagset (PPT) is designed to be used for morphological, lexical, and syntactic annotations of Persian corpora.DOR: 98.1000/1726-8125.2018.16.165.0.1.68.11
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
A new morphological lexicon and a POS tagger for the Persian Language
International audienceIn (Sagot and Walther, 2010), the authors introduce an advanced tokenizer and a morpho- logical lexicon for the Persian language named PerLex. In this paper, we describe experiments dedicated to enriching this lexicon and using it for building a POS tagger for Persian
Développement de ressources pour le persan : le nouveau lexique morphologique PerLex 2 et l'étiqueteur morphosyntaxique MElt-fa
International audienceDans cet article nous présentons une nouvelle version de PerLex, lexique morphologique du persan, une version corrigée et partiellement réannotée du corpus étiqueté BijanKhan (BijanKhan, 2004) et MEltfa, un nouvel étiqueteur morphosyntaxique librement disponible pour le persan. Après avoir développé une première version de PerLex (Sagot & Walther, 2010), nous en proposons donc ici une version améliorée. Outre une validation manuelle partielle, PerLex 2 repose désormais sur un inventaire de catégories linguistiquement motivé. Nous avons également développé une nouvelle version du corpus BijanKhan : cette nouvelle version contient des corrections significatives de la tokenisation ainsi qu'un réétiquetage à l'aide des nouvelles catégories. Cette nouvelle version du corpus a enfin été utilisée pour l'entraînement de MEltfa, notre étiqueteur morphosyntaxique pour le persan librement disponible, s'appuyant à la fois sur ce nouvel inventaire de catégories, sur PerLex 2 et sur le système d'étiquetage MElt (Denis & Sagot, 2009)
A multilingual collection of CoNLL-U-compatible morphological lexicons
International audienceWe introduce UDLexicons, a multilingual collection of morphological lexicons that follow the guidelines and format of the Universal Dependencies initiative. We describe the three approaches we use to create 53 morphological lexicons covering 38 languages, based on existing resources. These lexicons, which are freely available, have already proven useful for improving part-of-speech tagging accuracy in state-of-the-art architectures
Universal Dependencies and Morphology for Hungarian - and on the Price of Universality
In this paper, we present how the principles of universal dependencies and morphology have been adapted to Hungarian. We report the most challenging grammatical phenomena and our solutions to those. On the basis of the adapted guidelines, we have converted and manually corrected 1,800 sentences from the Szeged Treebank to universal dependency format. We also introduce experiments on this manually annotated corpus for evaluating automatic conversion and the added value of language-specific, i.e. non-universal, annotations. Our results reveal that converting to universal dependencies is not necessarily trivial, moreover, using language-specific morphological features may have an impact on overall performance
Main results of MONDILEX project
Main results of MONDILEX projectThe paper presents the results and recommendations of MONDILEX, a 7FP project that covered six Slavic languages: Bulgarian, Polish, Russian, Slovak, Slovene, and Ukrainian. The paper summarizes the research undertaken on standardisation and integration of Slavic language resources and on the establishment of a virtual organisation supporting research infrastructure for Slavic lexicography. The results should be useful for an implementation of a research infrastructure in the coming years