831 research outputs found

    Arabic Text Mining

    Full text link
    The rapid growth of the internet has increased the number of online texts. This led to the rapid growth of the number of online texts in the Arabic language. The enormous amount of text must be organized into classes to make the analysis process and text retrieval easier. Text classification is, therefore, a key component of text mining. There are numerous systems and approaches for categorizing literature in English, European (French, German, Spanish), and Asian (Chinese, Japanese). In contrast, there are relatively few studies on categorizing Arabic literature due to the difficulty of the Arabic language. In this work, a brief explanation of key ideas relevant to Arabic text mining are introduced then a new classification system for the Arabic language is presented using light stemming and Classifier Na\"ive Bayesian (CNB). Texts from two classes: politics and sports, are included in our corpus. Some texts are added to the system, and the system correctly classified them, demonstrating the effectiveness of the system

    An Ontology based Text-to-Picture Multimedia m-Learning System

    Get PDF
    Multimedia Text-to-Picture is the process of building mental representation from words associated with images. From the research aspect, multimedia instructional message items are illustrations of material using words and pictures that are designed to promote user realization. Illustrations can be presented in a static form such as images, symbols, icons, figures, tables, charts, and maps; or in a dynamic form such as animation, or video clips. Due to the intuitiveness and vividness of visual illustration, many text to picture systems have been proposed in the literature like, Word2Image, Chat with Illustrations, and many others as discussed in the literature review chapter of this thesis. However, we found that some common limitations exist in these systems, especially for the presented images. In fact, the retrieved materials are not fully suitable for educational purposes. Many of them are not context-based and didn’t take into consideration the need of learners (i.e., general purpose images). Manually finding the required pedagogic images to illustrate educational content for learners is inefficient and requires huge efforts, which is a very challenging task. In addition, the available learning systems that mine text based on keywords or sentences selection provide incomplete pedagogic illustrations. This is because words and their semantically related terms are not considered during the process of finding illustrations. In this dissertation, we propose new approaches based on the semantic conceptual graph and semantically distributed weights to mine optimal illustrations that match Arabic text in the children’s story domain. We combine these approaches with best keywords and sentences selection algorithms, in order to improve the retrieval of images matching the Arabic text. Our findings show significant improvements in modelling Arabic vocabulary with the most meaningful images and best coverage of the domain in discourse. We also develop a mobile Text-to-Picture System that has two novel features, which are (1) a conceptual graph visualization (CGV) and (2) a visual illustrative assessment. The CGV shows the relationship between terms associated with a picture. It enables the learners to discover the semantic links between Arabic terms and improve their understanding of Arabic vocabulary. The assessment component allows the instructor to automatically follow up the performance of learners. Our experiments demonstrate the efficiency of our multimedia text-to-picture system in enhancing the learners’ knowledge and boost their comprehension of Arabic vocabulary

    Towards building a standard dataset for Arabic keyphrase extraction evaluation

    Get PDF
    Keyphrases are short phrases that best represent a document content. They can be useful in a variety of applications, including document summarization and retrieval models. In this paper, we introduce the first dataset of keyphrases for an Arabic document collection, obtained by means of crowdsourcing. We experimentally evaluate different crowdsourced answer aggregation strategies and validate their performances against expert annotations to evaluate the quality of our dataset. We report about our experimental results, the dataset features

    Ontological Approach for Semantic Modelling of Malay Translated Qur’an

    Get PDF
    This thesis contributes to the areas of ontology development and analysis, natural language processing (NLP), Information Retrieval (IR), and Language Resource and Corpus Development. Research in Natural Language Processing and semantic search for English has shown successful results for more than a decade. However, it is difficult to adapt those techniques to the Malay language, because its complex morphology and orthographic forms are very different from English. Moreover, limited resources and tools for computational linguistic analysis are available for Malay. In this thesis, we address those issues and challenges by proposing MyQOS, the Malay Qur’an Ontology System, a prototype ontology-based IR with semantics for representing and accessing a Malay translation of the Qur’an. This supports the development of a semantic search engine and a question answering system and provides a framework for storing and accessing a Malay language corpus and providing computational linguistics resources. The primary use of MyQOS in the current research is for creating and improving the quality and accuracy of the query mechanism to retrieve information embedded in the Malay text of the Qur’an translation. To demonstrate the feasibility of this approach, we describe a new architecture of morphological analysis for MyQOS and query algorithms based on MyQOS. Data analysis consisted of two measures; precision and recall, where data was obtained from MyQOS Corpus conducted in three search engines. The precision and recall for semantic search are 0.8409 (84%) and 0.8043(80%), double the results of the question-answer search which are 0.4971(50%) for precision and 0.6027 (60%) for recall. The semantic search gives high precision and high recall comparing the other two methods. This indicates that semantic search returns more relevant results than irrelevant ones. To conclude, this research is among research in the retrieval of the Qur’an texts in the Malay language that managed to outline state-of-the-art information retrieval system models. Thus, the use of MyQOS will help Malay readers to understand the Qur’an in better ways. Furthermore, the creation of a Malay language corpus and computational linguistics resources will benefit other researchers, especially in religious texts, morphological analysis, and semantic modelling

    Sentiment analysis of Arabic tweets in e-learning

    Get PDF
    In this study, we present the design and implementation of Arabic text classification in regard to university students' opinions through different algorithms such as Support Vector Machine (SVM) and Naive Bayes (NB). The aim of the study is to develop a framework to analyse Twitter "tweets" as having negative, positive or neutral sentiments in education or, in other words, to illustrate the relationship between the sentiments conveyed in Arabic tweets and the students' learning experiences at universities. Two experiments were carried out, one using negative and positive classes only and the other one with a neutral class. The results show that in Arabic, a sentiments SVM with an n-gram feature achieved higher accuracy than NB both with using negative and positive classes only and with the neutral class

    Semantic Systems. The Power of AI and Knowledge Graphs

    Get PDF
    This open access book constitutes the refereed proceedings of the 15th International Conference on Semantic Systems, SEMANTiCS 2019, held in Karlsruhe, Germany, in September 2019. The 20 full papers and 8 short papers presented in this volume were carefully reviewed and selected from 88 submissions. They cover topics such as: web semantics and linked (open) data; machine learning and deep learning techniques; semantic information management and knowledge integration; terminology, thesaurus and ontology management; data mining and knowledge discovery; semantics in blockchain and distributed ledger technologies

    Machine Translation Vs. Multilingual Dictionaries Assessing Two Strategies for the Topic Modeling of Multilingual Text Collections

    Get PDF
    The goal of this paper is to evaluate two methods for the topic modeling of multilingual document collections: (1) machine translation (MT), and (2) the coding of semantic concepts using a multilingual dictionary (MD) prior to topic modeling. We empirically assess the consequences of these approaches based on both a quantitative comparison of models and a qualitative validation of each method’s potentials and weaknesses. Our case study uses two text collections (of tweets and news articles) in three languages (English, Hebrew, Arabic), covering the ongoing local conflicts between Israeli authorities, settlers, and Palestinian Bedouins in the West Bank. We find that both methods produce a large share of equivalent topics, especially in the context of fairly homogenous news discourse, yet show limited but systematic differences when applied to highly heterogenous social media discourse. While the MD model delivers a more nuanced picture of conflict-related topics, it misses several more peripheral topics, especially those unrelated to the dictionary’s focus, which are picked up by the MT model. Our study is a first step toward instrument validation, indicating that both methods yield valid, comparable results, while method-specific differences remain

    Advances in Automatic Keyphrase Extraction

    Get PDF
    The main purpose of this thesis is to analyze and propose new improvements in the field of Automatic Keyphrase Extraction, i.e., the field of automatically detecting the key concepts in a document. We will discuss, in particular, supervised machine learning algorithms for keyphrase extraction, by first identifying their shortcomings and then proposing new techniques which exploit contextual information to overcome them. Keyphrase extraction requires that the key concepts, or \emph{keyphrases}, appear verbatim in the body of the document. We will identify the fact that current algorithms do not use contextual information when detecting keyphrases as one of the main shortcomings of supervised keyphrase extraction. Instead, statistical and positional cues, like the frequency of the candidate keyphrase or its first appearance in the document, are mainly used to determine if a phrase appearing in a document is a keyphrase or not. For this reason, we will prove that a supervised keyphrase extraction algorithm, by using only statistical and positional features, is actually able to extract good keyphrases from documents written in languages that it has never seen. The algorithm will be trained over a common dataset for the English language, a purpose-collected dataset for the Arabic language, and evaluated on the Italian, Romanian and Portuguese languages as well. This result is then used as a starting point to develop new algorithms that use contextual information to increase the performance in automatic keyphrase extraction. The first algorithm that we present uses new linguistics features based on anaphora resolution, which is a field of natural language processing that exploits the relations between elements of the discourse as, e.g., pronouns. We evaluate several supervised AKE pipelines based on these features on the well-known SEMEVAL 2010 dataset, and we show that the performance increases when we add such features to a model that employs statistical and positional knowledge only. Finally, we investigate the possibilities offered by the field of Deep Learning, by proposing six different deep neural networks that perform automatic keyphrase extraction. Such networks are based on bidirectional long-short term memory networks, or on convolutional neural networks, or on a combination of both of them, and on a neural language model which creates a vector representation of each word of the document. These networks are able to learn new features using the the whole document when extracting keyphrases, and they have the advantage of not needing a corpus after being trained to extract keyphrases from new documents. We show that with deep learning based architectures we are able to outperform several other keyphrase extraction algorithms, both supervised and not supervised, used in literature and that the best performances are obtained when we build an additional neural representation of the input document and we append it to the neural language model. Both the anaphora-based and the deep-learning based approaches show that using contextual information, the performance in supervised algorithms for automatic keyphrase extraction improves. In fact, in the methods presented in this thesis, the algorithms which obtained the best performance are the ones receiving more contextual information, both about the relations of the potential keyphrase with other parts of the document, as in the anaphora based approach, and in the shape of a neural representation of the input document, as in the deep learning approach. In contrast, the approach of using statistical and positional knowledge only allows the building of language agnostic keyphrase extraction algorithms, at the cost of decreased precision and recall

    Inferring Student's Chat Topic in Colloquial Arabic Text using Semantic Representation

    Get PDF
    Since the colloquial Arabic is now widespread it is required to describe the collection and classification of a multi-dialectal corpus of Arabic. Nowadays, colloquial multi-dialectal comes in almost country based forms such as Egyptian, Iraqi, Levantine, Tunisian, etc. This paper discusses a new method for analyzing the conversation of the educational chat room using Corpus for Palestinian Arabic and Stanford Tagger. This method represents the key words using semantic net-like representation to obtain the main subjects of the conversation. The main subject of the chat is obtained using the proposed method which shows a high accuracy. Using Arabic Corpus, Stanford Tagger and percentage of words will add more accuracy. The study also examines the effect of pivot distribution based on occurrences and betweeness values of the pivots over the text. This study examines some of the characteristics of the texts written in colloquial Arabic dialect and analyzes the free expressive Arabic statements. The results of the paper show that the core can be determined by combining both the occurrences and the distribution of the word over the conversation

    Classifying the suras by their lexical semantics :an exploratory multivariate analysis approach to understanding the Qur'an

    Get PDF
    PhD ThesisThe Qur'an is at the heart of Islamic culture. Careful, well-informed interpretation of it is fundamental both to the faith of millions of Muslims throughout the world, and also to the non-Islamic world's understanding of their religion. There is a long and venerable tradition of Qur'anic interpretation, and it has necessarily been based on literary-historical methods for exegesis of hand-written and printed text. Developments in electronic text representation and analysis since the second half of the twentieth century now offer the opportunity to supplement traditional techniques by applying the newly-emergent computational technology of exploratory multivariate analysis to interpretation of the Qur'an. The general aim of the present discussion is to take up that opportunity. Specifically, the discussion develops and applies a methodology for discovering the thematic structure of the Qur'an based on a fundamental idea in a range of computationally oriented disciplines: that, with respect to some collection of texts, the lexical frequency profiles of the individual texts are a good indicator of their semantic content, and thus provide a reliable criterion for their conceptual categorization relative to one another. This idea is applied to the discovery of thematic interrelationships among the suras that constitute the Qur'an by abstracting lexical frequency data from them and then analyzing that data using exploratory multivariate methods in the hope that this will generate hypotheses about the thematic structure of the Qur'an. The discussion is in eight main parts. The first part introduces the discussion. The second gives an overview of the structure and thematic content of the Qur'an and of the tradition of Qur'anic scholarship devoted to its interpretation. The third part xvi defines the research question to be addressed together with a methodology for doing so. The fourth reviews the existing literature on the research question. The fifth outlines general principles of data creation and applies them to creation of the data on which the analysis of the Qur'an in this study is based. The sixth outlines general principles of exploratory multivariate analysis, describes in detail the analytical methods selected for use, and applies them to the data created in part five. The seventh part interprets the results of the analyses conducted in part six with reference to the existing results in Qur'anic interpretation described in part two. And, finally, the eighth part draws conclusions relative to the research question and identifies directions along which the work presented in this study can be developed
    • …
    corecore