114 research outputs found
Latent-Variable PCFGs: Background and Applications
Latent-variable probabilistic context-free grammars are
latent-variable models that are based on context-free grammars.
Nonterminals are associated with latent states that provide
contextual information during the top-down rewriting process of
the grammar.
We survey a few of the techniques used to estimate such grammars
and to parse text with them. We also give an overview of what the latent
states represent for English Penn treebank parsing, and provide
an overview of extensions and related models to these grammars
Nabra: Syrian Arabic Dialects with Morphological Annotations
This paper presents Nabra, a corpora of Syrian Arabic dialects with
morphological annotations. A team of Syrian natives collected more than 6K
sentences containing about 60K words from several sources including social
media posts, scripts of movies and series, lyrics of songs and local proverbs
to build Nabra. Nabra covers several local Syrian dialects including those of
Aleppo, Damascus, Deir-ezzur, Hama, Homs, Huran, Latakia, Mardin, Raqqah, and
Suwayda. A team of nine annotators annotated the 60K tokens with full
morphological annotations across sentence contexts. We trained the annotators
to follow methodological annotation guidelines to ensure unique morpheme
annotations, and normalized the annotations. F1 and kappa agreement scores
ranged between 74% and 98% across features, showing the excellent quality of
Nabra annotations. Our corpora are open-source and publicly available as part
of the Currasat portal https://sina.birzeit.edu/currasat
Discourse Structure in Machine Translation Evaluation
In this article, we explore the potential of using sentence-level discourse
structure for machine translation evaluation. We first design discourse-aware
similarity measures, which use all-subtree kernels to compare discourse parse
trees in accordance with the Rhetorical Structure Theory (RST). Then, we show
that a simple linear combination with these measures can help improve various
existing machine translation evaluation metrics regarding correlation with
human judgments both at the segment- and at the system-level. This suggests
that discourse information is complementary to the information used by many of
the existing evaluation metrics, and thus it could be taken into account when
developing richer evaluation metrics, such as the WMT-14 winning combined
metric DiscoTKparty. We also provide a detailed analysis of the relevance of
various discourse elements and relations from the RST parse trees for machine
translation evaluation. In particular we show that: (i) all aspects of the RST
tree are relevant, (ii) nuclearity is more useful than relation type, and (iii)
the similarity of the translation RST tree to the reference tree is positively
correlated with translation quality.Comment: machine translation, machine translation evaluation, discourse
analysis. Computational Linguistics, 201
Developing a Comprehensive Standard Persian Positional Tagset
One of the primary tools used in text processing tasks such as information retrieval, text extraction, and text mining, is a corpus that is enhnaced by linguistic tags. In a corpus development effort, the role of a POS-tagger is to assign a linguistic tag to every textual token. POS annotation relies heavily on a tagset based on a linguistic theory. Text processing in Persian, too, follows this common practice. Several tagsets have been introduced, so far, to annotate Persian corpora. However, each tagset has followed a specific standard and linguistic theory. The resulting tagsets contain a limited number of tags, which renders them inadequate for a larger scope of research. This study is inspired by EAGLES, MULTEXT-East, positional tagset standards to produce a comprehensive standard positional tagset for Persian. The proposed tagset is also informed by the existing Persian tagsets. The proposed Persian Positional Tagset (PPT) is designed to be used for morphological, lexical, and syntactic annotations of Persian corpora.DOR: 98.1000/1726-8125.2018.16.165.0.1.68.11
- …