24 research outputs found

    The GW/LT3 VarDial 2016 shared task system for dialects and similar languages detection

    Get PDF
    This paper describes the GW/LT3 contribution to the 2016 VarDial shared task on the identification of similar languages (task 1) and Arabic dialects (task 2). For both tasks, we experimented with Logistic Regression and Neural Network classifiers in isolation. Additionally, we implemented a cascaded classifier that consists of coarse and fine-grained classifiers (task 1) and a classifier ensemble with majority voting for task 2. The submitted systems obtained state-of-the-art performance and ranked first for the evaluation on social media data (test sets B1 and B2 for task 1), with a maximum weighted F1 score of 91.94%

    Exploring Twitter as a Source of an Arabic Dialect Corpus

    Get PDF
    Given the lack of Arabic dialect text corpora in comparison with what is available for dialects of English and other languages, there is a need to create dialect text corpora for use in Arabic natural language processing. What is more, there is an increasing use of Arabic dialects in social media, so this text is now considered quite appropriate as a source of a corpus. We collected 210,915K tweets from five groups of Arabic dialects Gulf, Iraqi, Egyptian, Levantine, and North African. This paper explores Twitter as a source and describes the methods that we used to extract tweets and classify them according to the geographic location of the sender. We classified Arabic dialects by using Waikato Environment for Knowledge Analysis (WEKA) data analytic tool which contains many alternative filters and classifiers for machine learning. Our approach in classification tweets achieved an accuracy equal to 79%

    Developing resources for sentiment analysis of informal Arabic text in social media

    Get PDF
    Natural Language Processing (NLP) applications such as text categorization, machine translation, sentiment analysis, etc., need annotated corpora and lexicons to check quality and performance. This paper describes the development of resources for sentiment analysis specifically for Arabic text in social media. A distinctive feature of the corpora and lexicons developed are that they are determined from informal Arabic that does not conform to grammatical or spelling standards. We refer to Arabic social media content of this sort as Dialectal Arabic (DA) - informal Arabic originating from and potentially mixing a range of different individual dialects. The paper describes the process adopted for developing corpora and sentiment lexicons for sentiment analysis within different social media and their resulting characteristics. The addition to providing useful NLP data sets for Dialectal Arabic the work also contributes to understanding the approach to developing corpora and lexicons

    Using Deep Learning Networks to Predict Telecom Company Customer Satisfaction Based on Arabic Tweets

    Get PDF
    Information systems are transforming businesses, which are using modern technologies towards new business models based on digital solutions, which ultimately lead to the design of novel socio-economic systems. Sentiment analysis is, in this context, a thriving research area. This paper is a case study of Saudi telecommunications (telecom) companies, using sentiment analysis for customer satisfaction based on a corpus of Arabic tweets. This paper compares, for the first time for Saudi social media in telecommunication, the most popular machine learning approach, support vector machine (SVM), with two deep learning approaches: long short-term memory (LSTM) and gated recurrent unit (GRU). This study used LSTM and GRU with two different implementations, adding attention mechanism and character encoding. The study concluded that the bidirectional-GRU with attention mechanism achieved a better performance in the telecommunication domain and allowed detection of customer satisfaction in the telecommunication domain with high accuracy
    corecore