114 research outputs found

    Latent-Variable PCFGs: Background and Applications

    Get PDF
    Latent-variable probabilistic context-free grammars are latent-variable models that are based on context-free grammars. Nonterminals are associated with latent states that provide contextual information during the top-down rewriting process of the grammar. We survey a few of the techniques used to estimate such grammars and to parse text with them. We also give an overview of what the latent states represent for English Penn treebank parsing, and provide an overview of extensions and related models to these grammars

    Nabra: Syrian Arabic Dialects with Morphological Annotations

    Full text link
    This paper presents Nabra, a corpora of Syrian Arabic dialects with morphological annotations. A team of Syrian natives collected more than 6K sentences containing about 60K words from several sources including social media posts, scripts of movies and series, lyrics of songs and local proverbs to build Nabra. Nabra covers several local Syrian dialects including those of Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and Suwayda. A team of nine annotators annotated the 60K tokens with full morphological annotations across sentence contexts. We trained the annotators to follow methodological annotation guidelines to ensure unique morpheme annotations, and normalized the annotations. F1 and kappa agreement scores ranged between 74% and 98% across features, showing the excellent quality of Nabra annotations. Our corpora are open-source and publicly available as part of the Currasat portal https://sina.birzeit.edu/currasat

    Discourse Structure in Machine Translation Evaluation

    Full text link
    In this article, we explore the potential of using sentence-level discourse structure for machine translation evaluation. We first design discourse-aware similarity measures, which use all-subtree kernels to compare discourse parse trees in accordance with the Rhetorical Structure Theory (RST). Then, we show that a simple linear combination with these measures can help improve various existing machine translation evaluation metrics regarding correlation with human judgments both at the segment- and at the system-level. This suggests that discourse information is complementary to the information used by many of the existing evaluation metrics, and thus it could be taken into account when developing richer evaluation metrics, such as the WMT-14 winning combined metric DiscoTKparty. We also provide a detailed analysis of the relevance of various discourse elements and relations from the RST parse trees for machine translation evaluation. In particular we show that: (i) all aspects of the RST tree are relevant, (ii) nuclearity is more useful than relation type, and (iii) the similarity of the translation RST tree to the reference tree is positively correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse analysis. Computational Linguistics, 201

    Developing a Comprehensive Standard Persian Positional Tagset

    Get PDF
    One of the primary tools used in text processing tasks such as information retrieval, text extraction, and text mining, is a corpus that is enhnaced by linguistic tags.  In a corpus development effort, the role of a POS-tagger is to assign a linguistic tag to every textual token.  POS annotation relies heavily on a tagset based on a linguistic theory.  Text processing in Persian, too, follows this common practice.  Several tagsets have been introduced, so far, to annotate Persian corpora.  However, each tagset has followed a specific standard and linguistic theory.  The resulting tagsets contain a limited number of tags, which renders them inadequate for a larger scope of research.  This study is inspired by EAGLES, MULTEXT-East, positional tagset standards to produce a comprehensive standard positional tagset for Persian.  The proposed tagset is also informed by the existing Persian tagsets.  The proposed Persian Positional Tagset (PPT) is designed to be used for morphological, lexical, and syntactic annotations of Persian corpora.DOR: 98.1000/1726-8125.2018.16.165.0.1.68.11
    • …
    corecore