2,520 research outputs found
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages
Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010
Recommended from our members
Sentiment Analysis for the Low-Resourced Latinised Arabic "Arabizi"
The expansion of digital communication mediums from private mobile messaging into the public through social media presented an opportunity for the data science research and industry to mine the generated big data for artificial information extraction. A popular information extraction task is sentiment analysis, which aims at extracting polarity opinions, positive, negative, or neutral, from the written natural language. This science helped organisations better understand the public’s opinion towards events, news, public figures, and products.
However, sentiment analysis has advanced for the English language ahead of Arabic. While sentiment analysis for Arabic is developing in the literature of Natural Language Processing (NLP), a popular variety of Arabic, Arabizi, has been overlooked for sentiment analysis advancements.
Arabizi is an informal transcription of the spoken dialectal Arabic in Latin script used for social texting. It is known to be common among the Arab youth, yet it is overlooked in efforts on Arabic sentiment analysis for its linguistic complexities.
As to Arabic, Arabizi is rich in inflectional morphology, but also codeswitched with English or French, and distinctively transcribed without adhering to a standard orthography. The rich morphology, inconsistent orthography, and codeswitching challenges are compounded together to have a multiplied effect on the lexical sparsity of the language, where each Arabizi word becomes eligible to be spelled in many ways, that, in addition to the mixing of other languages within the same textual context. The resulting high degree of lexical sparsity defies the very basics of sentiment analysis, classification of positive and negative words. Arabizi is even faced with a severe shortage of data resources that are required to set out any sentiment analysis approach.
In this thesis, we tackle this gap by conducting research on sentiment analysis for Arabizi. We addressed the sparsity challenge by harvesting Arabizi data from multi-lingual social media text using deep learning to build Arabizi resources for sentiment analysis. We developed six new morphologically and orthographically rich Arabizi sentiment lexicons and set the baseline for Arabizi sentiment analysis on social media
Morphological awareness in readers of IsiXhosa
This study focuses particularly on the development of four Morphological Awareness reading tests in isiXhosa and on the relationship of Morphological Awareness to reading success among 74 Grade 3 isiXhosa-speaking foundation-phase learners from three peri-urban schools. It explores in-depth why not all previously established Morphological Awareness tests for other languages suit the morphology of isiXhosa and how these tests have been revised in order to do so. Conventionally, the focus of Morphological Awareness literature has been on derivational morphology and reading comprehension. This study did not find significant correlations with comprehension, but rather with the children's ability to decode. Fluency and Morphological Awareness have not been given as much attention in the literature, but Morphological Awareness could be important for processing the agglutinating structure of the language in reading. This study also argues that it is not a specific awareness of derivational morphology over inflectional morphology, but rather a general awareness of one's language structure that is more important at this stage in their literacy development; specifically a general awareness of prefixes and suffixes. In addition, it was found that an explicit awareness of the morphological structure of the language related more to fluency and tests that accessed an innate and implicit Morphological Awareness had the strongest correlations overall with comprehension. The findings from this report have implications regarding how future curriculum developments for morphologically rich languages like isiXhosa should be approached. The positive and practical implications of including different types of Morphological Awareness tutoring in curricula is argued for, especially when teaching younger readers how to approach morphologically complex words in texts
A Bigger Fish to Fry:Scaling up the Automatic Understanding of Idiomatic Expressions
In this thesis, we are concerned with idiomatic expressions and how to handle them within NLP. Idiomatic expressions are a type of multiword phrase which have a meaning that is not a direct combination of the meaning of its parts, e.g. 'at a crossroads' and 'move the goalposts'.In Part I, we provide a general introduction to idiomatic expressions and an overview of observations regarding idioms based on corpus data. In addition, we discuss existing research on idioms from an NLP perspective, providing an overview of existing tasks, approaches, and datasets. In Part II, we focus on the building of a large idiom corpus, consisting of developing a system for the automatic extraction of potentially idiom expressions and building a large corpus of idiom using crowdsourced annotation. Finally, in Part III, we improve an existing unsupervised classifier and compare it to other existing classifiers. Given the relatively poor performance of this unsupervised classifier, we also develop a supervised deep neural network-based system and find that a model involving two separate modules looking at different information sources yields the best performance, surpassing previous state-of-the-art approaches.In conclusion, this work shows the feasibility of building a large corpus of sense-annotated potentially idiomatic expressions, and the benefits such a corpus provides for further research. It provides the possibility for quick testing of hypotheses about the distribution and usage of idioms, it enables the training of data-hungry machine learning methods for PIE disambiguation systems, and it permits fine-grained, reliable evaluation of such systems
- …