598 research outputs found

    Topic Detection on Twitter Using Deep Learning Method with Feature Expansion GloVe

    Get PDF
    Twitter is a medium of communication, transmission of information, and exchange of opinions on a topic with an extensive reach. Twitter has a tweet with a text message of 280 characters. Because text messages can only be written briefly, tweets often use slang and may not follow structured grammar. The diverse vocabulary in tweets leads to word discrepancies, so tweets are difficult to understand. The problem often found in classifying topics in tweets is that they need higher accuracy due to these factors. Therefore, the authors used the GloVe feature expansion to reduce vocabulary discrepancies by building a corpus from Twitter and IndoNews. Research on the classification of topics in previous tweets has been done extensively with various Machine Learning or Deep Learning methods using feature expansion. However, To the best of our knowledge, Hybrid Deep Learning has not been previously used for topic classification on Twitter. Therefore, the study conducted experiments to analyze the impact of Hybrid Deep Learning and the expansion of GloVe features on classification topics. The total data used in this study was 55,411 datasets in Indonesian-language text. The methods used in this study are Convolutional Neural Network (CNN), Recurrent Neural Network (RNN), and Hybrid CNN-RNN. The results show that the topic classification system with GloVe feature expansion using the CNN method achieved the highest accuracy of 92.80%, with an increase of 0.40% compared to the baseline. The RNN followed it with an accuracy of 93.72% and a 0.23% improvement. The CNN-RN Hybrid Deep Learning model achieved the highest accuracy of 94.56%, with a significant increase of 2.30%. The RNN-CNN model also achieved high accuracy, reaching 94.39% with a 0.95% increase. Based on the accuracy results, the Hybrid Deep Learning model, with the addition of feature expansion, significantly improved the system's performance, resulting in higher accuracy

    Applications of Mining Arabic Text: A Review

    Get PDF
    Since the appearance of text mining, the Arabic language gained some interest in applying several text mining tasks over a text written in the Arabic language. There are several challenges faced by the researchers. These tasks include Arabic text summarization, which is one of the challenging open areas for research in natural language processing (NLP) and text mining fields, Arabic text categorization, and Arabic sentiment analysis. This chapter reviews some of the past and current researches and trends in these areas and some future challenges that need to be tackled. It also presents some case studies for two of the reviewed approaches

    تطوير منهجية تعتمد على تنقيب الأنماط المتكررة المرنة للكشف عن الأحداث الهامة في المدونات العربية المصغرة

    Get PDF
    Recently, Microblogs have become the new communication medium between users. It allows millions of users to post and share content of their own activities, opinions about different topics. Posting about occurring real-world events has attracted people to follow events through microblogs instead of mainstream media. As a result, there is an urgent need to detect events from microblogs so that users can identify events quickly, also and more importantly to aid higher authorities to respond faster to occurring events by taking proper actions. While considerable researches have been conducted for event detection on the English language. Arabic context have not received much research even though there are millions of Arabic users. Also existing approaches rely on platform dependent features such as hashtags, mentions, retweets etc. which make their approaches fail when these features are not present in the process. In addition to that, approaches that depend on the presence of frequently used words only do not always detect real events because it cannot differentiate events and general viral topics. In this thesis, we propose an approach for Arabic event detection from microblogs. We first collect the data, then a preprocessing step is applied to enhance the data quality and reduce noise. The sentence text is analyzed and the part-of-speech tags are identified. Then a set of rules are used to extract event indicator keywords called event triggers. The frequency of each event triggers is calculated, where event triggers that have frequencies higher than the average are kept, or removed otherwise. We detect events by clustering similar event triggers together. An Adapted soft frequent pattern mining is applied to the remaining event triggers for clustering. We used a dataset called Evetar to evaluate the proposed approach. The dataset contains tweets that cover different types of Arabic events that occurred in a one month period. We split the dataset into different subsets using different time intervals, so that we can mimic the streaming behavior of microblogs. We used precision, recall and fmeasure as evaluation metrics. The highest average f-measure value achieved was 0.717. Our results were acceptable compared to three popular approaches applied to the same dataset.حديثا،ً أصبحت المدونات الصغيرة وسيلة إتصال جديدة بين المستخدمين. فقد سمحت لملايين المستخدمين من نشر ومشاركة محتويات متعلقة بأنشطتهم وأرائهم عن مواضيع مختلفة. إن نشر المحتوى المتعلق بالأحداث الجارية في العالم الحقيقي قد جذب الناس لمتابعة الأحداث من خلال المدونات الصغيرة بدلاً من وسائل الإعلام الرئيسية. نتيجة لذلك، أصبحت هناك حاجة طارئة لكشف الأحداث من الدونات الصغيرة حتى يتمكن المستخدمون من تحديد الأحداث الجارية بشكل أسرع، أيضا والأهم من ذلك، مساعدة السلطات العليا للإستجابة بشكل سريع في عمل اللازم عند حدوث حدثا ما. في حين أنه أجريت العديد من الأبحاث على كشف الأحداث باللغة الإنجليزية، إلا أن السياق العربي لم يأخذ نصيبا وفير ا في هذا المجال، على الرغم من وجود الملايين من المستخدمين العرب. ايضا،ً العديد من المناهج الموجودة حاليا تعتمد على خصائص معتمدة على المنصة المستخدمة في البحث مثل وسم الهاشتاق، وتأشيرة المستخدم، وإعادة التغريد، إلخ. مما يجعل النهج المستخدم يتأثر سلبا في حال لم تكن هذه الخصائص موجودة أثناء عملية الكشف عن الأحداث. بالإضافة الي ذلك، المناهج التي تعتمد فقط على وجود الكلمات الأكثر استخداما لا تكشف الاحداث الحقيقية دائما لانها لا تستطيع التفرقة بين الحدث والمواضيع العامة الشائعة. في هذه الأطروحة، نقترح نهج لكشف الأحداث العربية من المدونات الصغيرة. أولاً نقوم بجمع البيانات، ثم نقوم بتجهيزها من خلال تحسينها وتقليل الشوائب فيها. يتم تحليل نص الجملة لإستخراج الأوسمة الخاصة بأجزاء الكلام. بعدها نقوم بتطبيق مجموعة من القواعد لإستخراج الكلمات الدلالية التي تشير إلي الأحدات و تسمى مشغلات الأحداث. يتم حساب عدد تكرار كل مشغل حدث، بحيث يتم الإحتفاظ على المشغلات التي لها عدد تكراراكبر من المتوسط ويتم حذف عكس ذالك. يتم الكشف عن الحدث من خلال تجميع مشغلات الأحداث المتشابهة مع بعضها. حيث نقوم بتطبيق إصدار ملائم من خوارزمية "التنقيب الناعم عن الأنماط المتكررة" على مشغلات الأحداث التي تبقت لكي يتم تجميع المتشابه منها. قمنا بإستخدام قاعدة بيانات تسمى (Evetar) لتقييم النهج المقترح. حيث تحتوي قاعدة البيانات على تغريدات تغطى عدة انواع من الأحداث العربية التي حدثت خلال فترة شهر. لكي نقوم بمحاكاة طريقة تدفق البيانات في المدونات الصغيرة، قمنا بتقسييم البيانات إلي عدة مجموعات بناءاُ على فترات زمنية مختلفة. تم استخدام كل من (Precision)، (Recall)، (F-Measure) كمقياس للتقييم، حيث كانت أعلى متوسط قيمة لل (F-Measure) تم الحصول عليها هي 0.717 . تعتبر النتائج التي حصلنا عليها مقبولة مقارنة مع ثلاث مناهج مشهورة تم تطبيقها على نفس قاعدة البيانات

    Mixed-Language Arabic- English Information Retrieval

    Get PDF
    Includes abstract.Includes bibliographical references.This thesis attempts to address the problem of mixed querying in CLIR. It proposes mixed-language (language-aware) approaches in which mixed queries are used to retrieve most relevant documents, regardless of their languages. To achieve this goal, however, it is essential firstly to suppress the impact of most problems that are caused by the mixed-language feature in both queries and documents and which result in biasing the final ranked list. Therefore, a cross-lingual re-weighting model was developed. In this cross-lingual model, term frequency, document frequency and document length components in mixed queries are estimated and adjusted, regardless of languages, while at the same time the model considers the unique mixed-language features in queries and documents, such as co-occurring terms in two different languages. Furthermore, in mixed queries, non-technical terms (mostly those in non-English language) would likely overweight and skew the impact of those technical terms (mostly those in English) due to high document frequencies (and thus low weights) of the latter terms in their corresponding collection (mostly the English collection). Such phenomenon is caused by the dominance of the English language in scientific domains. Accordingly, this thesis also proposes reasonable re-weighted Inverse Document Frequency (IDF) so as to moderate the effect of overweighted terms in mixed queries

    Automatic summarization of real world events using Twitter

    Get PDF
    Microblogging sites, such as Twitter, have become increasingly popular in recent years for reporting details of real world events via the Web. Smartphone apps enable people to communicate with a global audience to express their opinion and commentate on ongoing situations - often while geographically proximal to the event. Due to the heterogeneity and scale of the data and the fact that some messages are more salient than others for the purposes of understanding any risk to human safety and managing any disruption caused by events, automatic summarization of event-related microblogs is a non-trivial and important problem. In this paper we tackle the task of automatic summarization of Twitter posts, and present three methods that produce summaries by selecting the most representative posts from real-world tweet-event clusters. To evaluate our approaches, we compare them to the state-of-the-art summarization systems and human generated summaries. Our results show that our proposed methods outperform all the other summarization systems for English and non-English corpora

    New techniques and framework for sentiment analysis and tuning of CRM structure in the context of Arabic language

    Get PDF
    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirements for the degree of Doctor of PhilosophyKnowing customers’ opinions regarding services received has always been important for businesses. It has been acknowledged that both Customer Experience Management (CEM) and Customer Relationship Management (CRM) can help companies take informed decisions to improve their performance in the decision-making process. However, real-word applications are not so straightforward. A company may face hard decisions over the differences between the opinions predicted by CRM and actual opinions collected in CEM via social media platforms. Until recently, how to integrate the unstructured feedback from CEM directly into CRM, especially for the Arabic language, was still an open question. Furthermore, an accurate labelling of unstructured feedback is essential for the quality of CEM. Finally, CRM needs to be tuned and revised based on the feedback from social media to realise its full potential. However, the tuning mechanism for CEM of different levels has not yet been clarified. Facing these challenges, in this thesis, key techniques and a framework are presented to integrate Arabic sentiment analysis into CRM. First, as text pre-processing and classification are considered crucial to sentiment classification, an investigation is carried out to find the optimal techniques for the pre-processing and classification of Arabic sentiment analysis. Recommendations for using sentiment analysis classification in MSA as well as Saudi dialects are proposed. Second, to deal with the complexities of the Arabic language and to help operators identify possible conflicts in their original labelling, this study proposes techniques to improve the labelling process of Arabic sentiment analysis with the introduction of neural classes and relabelling. Finally, a framework for adjusting CRM via CEM for both the structure of the CRM system (on the sentence level) and the inaccuracy of the criteria or weights employed in the CRM system (on the aspect level) are proposed. To ensure the robustness and the repeatability of the proposed techniques and framework, the results of the study are further validated with real-word applications from different domains

    Knowledge-Based Techniques for Scholarly Data Access: Towards Automatic Curation

    Get PDF
    Accessing up-to-date and quality scientific literature is a critical preliminary step in any research activity. Identifying relevant scholarly literature for the extents of a given task or application is, however a complex and time consuming activity. Despite the large number of tools developed over the years to support scholars in their literature surveying activity, such as Google Scholar, Microsoft Academic search, and others, the best way to access quality papers remains asking a domain expert who is actively involved in the field and knows research trends and directions. State of the art systems, in fact, either do not allow exploratory search activity, such as identifying the active research directions within a given topic, or do not offer proactive features, such as content recommendation, which are both critical to researchers. To overcome these limitations, we strongly advocate a paradigm shift in the development of scholarly data access tools: moving from traditional information retrieval and filtering tools towards automated agents able to make sense of the textual content of published papers and therefore monitor the state of the art. Building such a system is however a complex task that implies tackling non trivial problems in the fields of Natural Language Processing, Big Data Analysis, User Modelling, and Information Filtering. In this work, we introduce the concept of Automatic Curator System and present its fundamental components.openDottorato di ricerca in InformaticaopenDe Nart, Dari

    Can we predict a riot? Disruptive event detection using Twitter

    Get PDF
    In recent years, there has been increased interest in real-world event detection using publicly accessible data made available through Internet technology such as Twitter, Facebook, and YouTube. In these highly interactive systems, the general public are able to post real-time reactions to “real world” events, thereby acting as social sensors of terrestrial activity. Automatically detecting and categorizing events, particularly small-scale incidents, using streamed data is a non-trivial task but would be of high value to public safety organisations such as local police, who need to respond accordingly. To address this challenge, we present an end-to-end integrated event detection framework that comprises five main components: data collection, pre-processing, classification, online clustering, and summarization. The integration between classification and clustering enables events to be detected, as well as related smaller-scale “disruptive events,” smaller incidents that threaten social safety and security or could disrupt social order. We present an evaluation of the effectiveness of detecting events using a variety of features derived from Twitter posts, namely temporal, spatial, and textual content. We evaluate our framework on a large-scale, real-world dataset from Twitter. Furthermore, we apply our event detection system to a large corpus of tweets posted during the August 2011 riots in England. We use ground-truth data based on intelligence gathered by the London Metropolitan Police Service, which provides a record of actual terrestrial events and incidents during the riots, and show that our system can perform as well as terrestrial sources, and even better in some cases
    corecore