1,089 research outputs found

    Context Modeling for Ranking and Tagging Bursty Features in Text Streams

    Get PDF
    Bursty features in text streams are very useful in many text mining applications. Most existing studies detect bursty features based purely on term frequency changes without taking into account the semantic contexts of terms, and as a result the detected bursty features may not always be interesting or easy to interpret. In this paper we propose to model the contexts of bursty features using a language modeling approach. We then propose a novel topic diversity-based metric using the context models to find newsworthy bursty features. We also propose to use the context models to automatically assign meaningful tags to bursty features. Using a large corpus of a stream of news articles, we quantitatively show that the proposed context language models for bursty features can effectively help rank bursty features based on their newsworthiness and to assign meaningful tags to annotate bursty features. ? 2010 ACM.EI

    Engineering Crowdsourced Stream Processing Systems

    Full text link
    A crowdsourced stream processing system (CSP) is a system that incorporates crowdsourced tasks in the processing of a data stream. This can be seen as enabling crowdsourcing work to be applied on a sample of large-scale data at high speed, or equivalently, enabling stream processing to employ human intelligence. It also leads to a substantial expansion of the capabilities of data processing systems. Engineering a CSP system requires the combination of human and machine computation elements. From a general systems theory perspective, this means taking into account inherited as well as emerging properties from both these elements. In this paper, we position CSP systems within a broader taxonomy, outline a series of design principles and evaluation metrics, present an extensible framework for their design, and describe several design patterns. We showcase the capabilities of CSP systems by performing a case study that applies our proposed framework to the design and analysis of a real system (AIDR) that classifies social media messages during time-critical crisis events. Results show that compared to a pure stream processing system, AIDR can achieve a higher data classification accuracy, while compared to a pure crowdsourcing solution, the system makes better use of human workers by requiring much less manual work effort

    تطوير منهجية تعتمد على تنقيب الأنماط المتكررة المرنة للكشف عن الأحداث الهامة في المدونات العربية المصغرة

    Get PDF
    Recently, Microblogs have become the new communication medium between users. It allows millions of users to post and share content of their own activities, opinions about different topics. Posting about occurring real-world events has attracted people to follow events through microblogs instead of mainstream media. As a result, there is an urgent need to detect events from microblogs so that users can identify events quickly, also and more importantly to aid higher authorities to respond faster to occurring events by taking proper actions. While considerable researches have been conducted for event detection on the English language. Arabic context have not received much research even though there are millions of Arabic users. Also existing approaches rely on platform dependent features such as hashtags, mentions, retweets etc. which make their approaches fail when these features are not present in the process. In addition to that, approaches that depend on the presence of frequently used words only do not always detect real events because it cannot differentiate events and general viral topics. In this thesis, we propose an approach for Arabic event detection from microblogs. We first collect the data, then a preprocessing step is applied to enhance the data quality and reduce noise. The sentence text is analyzed and the part-of-speech tags are identified. Then a set of rules are used to extract event indicator keywords called event triggers. The frequency of each event triggers is calculated, where event triggers that have frequencies higher than the average are kept, or removed otherwise. We detect events by clustering similar event triggers together. An Adapted soft frequent pattern mining is applied to the remaining event triggers for clustering. We used a dataset called Evetar to evaluate the proposed approach. The dataset contains tweets that cover different types of Arabic events that occurred in a one month period. We split the dataset into different subsets using different time intervals, so that we can mimic the streaming behavior of microblogs. We used precision, recall and fmeasure as evaluation metrics. The highest average f-measure value achieved was 0.717. Our results were acceptable compared to three popular approaches applied to the same dataset.حديثا،ً أصبحت المدونات الصغيرة وسيلة إتصال جديدة بين المستخدمين. فقد سمحت لملايين المستخدمين من نشر ومشاركة محتويات متعلقة بأنشطتهم وأرائهم عن مواضيع مختلفة. إن نشر المحتوى المتعلق بالأحداث الجارية في العالم الحقيقي قد جذب الناس لمتابعة الأحداث من خلال المدونات الصغيرة بدلاً من وسائل الإعلام الرئيسية. نتيجة لذلك، أصبحت هناك حاجة طارئة لكشف الأحداث من الدونات الصغيرة حتى يتمكن المستخدمون من تحديد الأحداث الجارية بشكل أسرع، أيضا والأهم من ذلك، مساعدة السلطات العليا للإستجابة بشكل سريع في عمل اللازم عند حدوث حدثا ما. في حين أنه أجريت العديد من الأبحاث على كشف الأحداث باللغة الإنجليزية، إلا أن السياق العربي لم يأخذ نصيبا وفير ا في هذا المجال، على الرغم من وجود الملايين من المستخدمين العرب. ايضا،ً العديد من المناهج الموجودة حاليا تعتمد على خصائص معتمدة على المنصة المستخدمة في البحث مثل وسم الهاشتاق، وتأشيرة المستخدم، وإعادة التغريد، إلخ. مما يجعل النهج المستخدم يتأثر سلبا في حال لم تكن هذه الخصائص موجودة أثناء عملية الكشف عن الأحداث. بالإضافة الي ذلك، المناهج التي تعتمد فقط على وجود الكلمات الأكثر استخداما لا تكشف الاحداث الحقيقية دائما لانها لا تستطيع التفرقة بين الحدث والمواضيع العامة الشائعة. في هذه الأطروحة، نقترح نهج لكشف الأحداث العربية من المدونات الصغيرة. أولاً نقوم بجمع البيانات، ثم نقوم بتجهيزها من خلال تحسينها وتقليل الشوائب فيها. يتم تحليل نص الجملة لإستخراج الأوسمة الخاصة بأجزاء الكلام. بعدها نقوم بتطبيق مجموعة من القواعد لإستخراج الكلمات الدلالية التي تشير إلي الأحدات و تسمى مشغلات الأحداث. يتم حساب عدد تكرار كل مشغل حدث، بحيث يتم الإحتفاظ على المشغلات التي لها عدد تكراراكبر من المتوسط ويتم حذف عكس ذالك. يتم الكشف عن الحدث من خلال تجميع مشغلات الأحداث المتشابهة مع بعضها. حيث نقوم بتطبيق إصدار ملائم من خوارزمية "التنقيب الناعم عن الأنماط المتكررة" على مشغلات الأحداث التي تبقت لكي يتم تجميع المتشابه منها. قمنا بإستخدام قاعدة بيانات تسمى (Evetar) لتقييم النهج المقترح. حيث تحتوي قاعدة البيانات على تغريدات تغطى عدة انواع من الأحداث العربية التي حدثت خلال فترة شهر. لكي نقوم بمحاكاة طريقة تدفق البيانات في المدونات الصغيرة، قمنا بتقسييم البيانات إلي عدة مجموعات بناءاُ على فترات زمنية مختلفة. تم استخدام كل من (Precision)، (Recall)، (F-Measure) كمقياس للتقييم، حيث كانت أعلى متوسط قيمة لل (F-Measure) تم الحصول عليها هي 0.717 . تعتبر النتائج التي حصلنا عليها مقبولة مقارنة مع ثلاث مناهج مشهورة تم تطبيقها على نفس قاعدة البيانات

    Image Understanding by Socializing the Semantic Gap

    Get PDF
    Several technological developments like the Internet, mobile devices and Social Networks have spurred the sharing of images in unprecedented volumes, making tagging and commenting a common habit. Despite the recent progress in image analysis, the problem of Semantic Gap still hinders machines in fully understand the rich semantic of a shared photo. In this book, we tackle this problem by exploiting social network contributions. A comprehensive treatise of three linked problems on image annotation is presented, with a novel experimental protocol used to test eleven state-of-the-art methods. Three novel approaches to annotate, under stand the sentiment and predict the popularity of an image are presented. We conclude with the many challenges and opportunities ahead for the multimedia community

    Detecting and Tracking the Spread of Astroturf Memes in Microblog Streams

    Full text link
    Online social media are complementing and in some cases replacing person-to-person social interaction and redefining the diffusion of information. In particular, microblogs have become crucial grounds on which public relations, marketing, and political battles are fought. We introduce an extensible framework that will enable the real-time analysis of meme diffusion in social media by mining, visualizing, mapping, classifying, and modeling massive streams of public microblogging events. We describe a Web service that leverages this framework to track political memes in Twitter and help detect astroturfing, smear campaigns, and other misinformation in the context of U.S. political elections. We present some cases of abusive behaviors uncovered by our service. Finally, we discuss promising preliminary results on the detection of suspicious memes via supervised learning based on features extracted from the topology of the diffusion networks, sentiment analysis, and crowdsourced annotations
    corecore