405 research outputs found
MS-TR: A Morphologically Enriched Sentiment Treebank and Recursive Deep Models for Compositional Semantics in Turkish
Recursive Deep Models have been used as powerful models to learn
compositional representations of text for many natural language processing tasks.
However, they require structured input (i.e. sentiment treebank) to encode sentences
based on their tree-based structure to enable them to learn latent semantics
of words using recursive composition functions. In this paper, we present our
contributions and efforts for the Turkish Sentiment Treebank construction. We
introduce MS-TR, a Morphologically Enriched Sentiment Treebank, which was
implemented for training Recursive Deep Models to address compositional sentiment
analysis for Turkish, which is one of the well-known Morphologically Rich
Language (MRL). We propose a semi-supervised automatic annotation, as a distantsupervision
approach, using morphological features of words to infer the polarity of
the inner nodes of MS-TR as positive and negative. The proposed annotation model
has four different annotation levels: morph-level, stem-level, token-level, and
review-level. Each annotation level’s contribution was tested using three different
domain datasets, including product reviews, movie reviews, and the Turkish Natural
Corpus essays. Comparative results were obtained with the Recursive Neural Tensor Networks (RNTN) model which is operated over MS-TR, and conventional machine learning methods. Experiments proved that RNTN outperformed the baseline methods and achieved much better accuracy results compared to the baseline methods, which cannot accurately capture the aggregated sentiment information
Statistical Parsing by Machine Learning from a Classical Arabic Treebank
Research into statistical parsing for English has enjoyed over a decade of successful results. However, adapting these models to other languages has met with difficulties. Previous comparative work has shown that Modern Arabic is one of the most difficult languages to parse due to rich morphology and free word order. Classical Arabic is the ancient form of Arabic, and is understudied in computational linguistics, relative to its worldwide reach as the language of the Quran. The thesis is based on seven publications that make significant contributions to knowledge relating to annotating and parsing Classical Arabic.
Classical Arabic has been studied in depth by grammarians for over a thousand years using a traditional grammar known as i’rāb (إعغاة ). Using this grammar to develop a representation for parsing is challenging, as it describes syntax using a hybrid of phrase-structure and dependency relations. This work aims to advance the state-of-the-art for hybrid parsing by introducing a formal representation for annotation and a resource for machine learning. The main contributions are the first treebank for Classical Arabic and the first statistical dependency-based parser in any language for ellipsis, dropped pronouns and hybrid representations.
A central argument of this thesis is that using a hybrid representation closely aligned to traditional grammar leads to improved parsing for Arabic. To test this hypothesis, two approaches are compared. As a reference, a pure dependency parser is adapted using graph transformations, resulting in an 87.47% F1-score. This is compared to an integrated parsing model with an F1-score of 89.03%, demonstrating that joint dependency-constituency parsing is better suited to Classical Arabic.
The Quran was chosen for annotation as a large body of work exists providing detailed syntactic analysis. Volunteer crowdsourcing is used for annotation in combination with expert supervision. A practical result of the annotation effort is the corpus website: http://corpus.quran.com, an educational resource with over two million users per year
Recommended from our members
Parsing Arabic Dialects
The Arabic language is a collection of spoken dialects with important phonological, morphological, lexical, and syntactic differences, along with a standard written language, Modern Standard Arabic (MSA). Since the spoken dialects are not officially written, it is very costly to obtain adequate corpora to use for training dialect NLP tools such as parsers. In this paper, we address the problem of parsing transcribed spoken Levantine Arabic (LA). We do not assume the existence of any annotated LA corpus (except for development and testing), nor of a parallel corpus LA-MSA. Instead, we use explicit knowledge about the relation between LA and MSA
ORTHOGRAPHIC ENRICHMENT FOR ARABIC GRAMMATICAL ANALYSIS
Thesis (Ph.D.) - Indiana University, Linguistics, 2010The Arabic orthography is problematic in two ways: (1) it lacks the short vowels, and this leads to ambiguity as the same orthographic form can be pronounced in many different ways each of which can have its own grammatical category, and (2) the Arabic word may contain several units like pronouns, conjunctions, articles and prepositions without an intervening white space. These two problems lead to difficulties in the automatic processing of Arabic. The thesis proposes a pre-processing scheme that applies word segmentation and word vocalization for the purpose of grammatical analysis: part of speech tagging and parsing. The thesis examines the impact of human-produced vocalization and segmentation on the grammatical analysis of Arabic, then applies a pipeline of automatic vocalization and segmentation for the purpose of Arabic part of speech tagging. The pipeline is then used, along with the POS tags produced, for the purpose of dependency parsing, which produces grammatical relations between the words in a sentence. The study uses the memory-based algorithm for vocalization, segmentation, and part of speech tagging, and the natural language parser MaltParser for dependency parsing. The thesis represents the first approach to the processing of real-world Arabic, and has found that through the correct choice of features and algorithms, the need for pre-processing for grammatical analysis can be minimized
Zero-shot Dependency Parsing with Pre-trained Multilingual Sentence Representations
We investigate whether off-the-shelf deep bidirectional sentence
representations trained on a massively multilingual corpus (multilingual BERT)
enable the development of an unsupervised universal dependency parser. This
approach only leverages a mix of monolingual corpora in many languages and does
not require any translation data making it applicable to low-resource
languages. In our experiments we outperform the best CoNLL 2018
language-specific systems in all of the shared task's six truly low-resource
languages while using a single system. However, we also find that (i) parsing
accuracy still varies dramatically when changing the training languages and
(ii) in some target languages zero-shot transfer fails under all tested
conditions, raising concerns on the 'universality' of the whole approach.Comment: DeepLo workshop, EMNLP 201
- …