Search CORE

8 research outputs found

Exploring Online Novelty Detection Using First Story Detection Models

Author: Kelleher John D.
Ross Robert J.
Wang Fei
Publication venue: Dublin Institute of Technology
Publication date: 09/11/2018
Field of study

Online novelty detection is an important technology in understanding and exploiting streaming data. One application of online novelty detection is First Story Detection (FSD) which attempts to find the very first story about a new topic, e.g. the first news report discussing the “Beast from the East” hitting Ireland. Although hundreds of FSD models have been developed, the vast majority of these only aim at improving the performance of the detection for some specific dataset, and very few focus on the insight of novelty itself. We believe that online novelty detection, framed as an unsupervised learning problem, always requires a clear definition of novelty. Indeed, we argue the definition of novelty is the key issue in designing a good detection model. Within the context of FSD, we first categorise online novelty detection models into three main categories, based on different definitions of novelty scores, and then compare the performances of these model categories in different features spaces. Our experimental results show that the challenge of FSD varies across novelty scores (and corresponding model categories); and, furthermore, that the detection of novelty in the very popular Word2Vec feature space is more difficult than in a normal frequency-based feature space because of a loss of word specificity

Crossref

Arrow@TUDublin

Event Detection from Social Media Stream: Methods, Datasets and Opportunities

Author: Chao Yang
Li Dong
Li Quanzhi
Lu Yao
Zhang Chi
Publication venue
Publication date: 28/06/2023
Field of study

Social media streams contain large and diverse amount of information, ranging from daily-life stories to the latest global and local events and news. Twitter, especially, allows a fast spread of events happening real time, and enables individuals and organizations to stay informed of the events happening now. Event detection from social media data poses different challenges from traditional text and is a research area that has attracted much attention in recent years. In this paper, we survey a wide range of event detection methods for Twitter data stream, helping readers understand the recent development in this area. We present the datasets available to the public. Furthermore, a few research opportunitiesComment: 8 page

arXiv.org e-Print Archive

Sub-story detection in Twitter with hierarchical Dirichlet processes

Author: Bontcheva K.
Hepple M.
Preotiuc-Pietro D.
Srijith P.K.
Publication venue: 'Elsevier BV'
Publication date: 11/06/2016
Field of study

Social media has now become the de facto information source on real world events. The challenge, however, due to the high volume and velocity nature of social media streams, is in how to follow all posts pertaining to a given event over time – a task referred to as story detection. Moreover, there are often several different stories pertaining to a given event, which we refer to as sub-stories and the corresponding task of their automatic detection – as sub-story detection. This paper proposes hierarchical Dirichlet processes (HDP), a probabilistic topic model, as an effective method for automatic sub-story detection. HDP can learn sub-topics associated with sub-stories which enables it to handle subtle variations in sub-stories. It is compared with state-of-the-art story detection approaches based on locality sensitive hashing and spectral clustering. We demonstrate the superior performance of HDP for sub-story detection on real world Twitter data sets using various evaluation measures. The ability of HDP to learn sub-topics helps it to recall the sub-stories with high precision. This has resulted in an improvement of up to 60% in the F-score performance of HDP based sub-story detection approach compared to standard story detection approaches. A similar performance improvement is also seen using an information theoretic evaluation measure proposed for the sub-story detection task. Another contribution of this paper is in demonstrating that considering the conversational structures within the Twitter stream can bring up to 200% improvement in sub-story detection performance

arXiv.org e-Print Archive

Elsevier - Publisher Connector

Crossref

White Rose Research Online

Twitter-scale New Event Detection via K-term Hashing

Author: Dominik Wurzer
Miles Osborne
Victor Lavrenko
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2015
Field of study

mosborne2

CiteSeerX

Crossref

Distance,Time and Terms in First Story Detection

Author: Wang Fei
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2019
Field of study

First Story Detection (FSD) is an important application of online novelty detection within Natural Language Processing (NLP). Given a stream of documents, or stories, about news events in a chronological order, the goal of FSD is to identify the very ﬁrst story for each event. While a variety of NLP techniques have been applied to the task, FSD remains challenging because it is still not clear what is the most crucial factor in deﬁning the “story novelty”. Giventhesechallenges,thethesisaddressedinthisdissertationisthat the notion of novelty in FSD is multi-dimensional. To address this, the work presented has adopted a three dimensional analysis of the relative qualities of FSD systems and gone on to propose a speciﬁc method that wearguesigniﬁcantlyimprovesunderstandingandperformanceofFSD. FSD is of course not a new problem type; therefore, our ﬁrst dimen sion of analysis consists of a systematic study of detection models for ﬁrststorydetectionandthedistancesthatareusedinthedetectionmod els for deﬁning novelty. This analysis presents a tripartite categorisa tion of the detection models based on the end points of the distance calculation. The study also considers issues of document representation explicitly, and shows that even in a world driven by distributed repres iv entations,thenearestneighbourdetectionmodelwithTF-IDFdocument representations still achieves the state-of-the-art performance for FSD. Weprovideanalysisofthisimportantresultandsuggestpotentialcauses and consequences. Events are introduced and change at a relatively slow rate relative to the frequency at which words come in and out of usage on a docu ment by document basis. Therefore we argue that the second dimen sion of analysis should focus on the temporal aspects of FSD. Here we are concerned with not only the temporal nature of the detection pro cess, e.g., the time/history window over the stories in the data stream, but also the processes that underpin the representational updates that underpin FSD. Through a systematic investigation of static representa tions, and also dynamic representations with both low and high update frequencies, we show that while a dynamic model unsurprisingly out performs static models, the dynamic model in fact stops improving but stays steady when the update frequency gets higher than a threshold. Our third dimension of analysis moves across to the particulars of lexicalcontent,andcriticallytheaffectoftermsinthedeﬁnitionofstory novelty. Weprovideaspeciﬁcanalysisofhowtermsarerepresentedfor FSD, including the distinction between static and dynamic document representations, and the affect of out-of-vocabulary terms and the spe ciﬁcity of a word in the calculation of the distance. Our investigation showed that term distributional similarity rather than scale of common v terms across the background and target corpora is the most important factor in selecting background corpora for document representations in FSD. More crucially, in this work the simple idea of the new terms emerged as a vital factor in deﬁning novelty for the ﬁrst story

Arrow@TUDublin