Search CORE

118 research outputs found

Exploring Online Novelty Detection Using First Story Detection Models

Author: Kelleher John D.
Ross Robert J.
Wang Fei
Publication venue: Dublin Institute of Technology
Publication date: 09/11/2018
Field of study

Online novelty detection is an important technology in understanding and exploiting streaming data. One application of online novelty detection is First Story Detection (FSD) which attempts to find the very first story about a new topic, e.g. the first news report discussing the “Beast from the East” hitting Ireland. Although hundreds of FSD models have been developed, the vast majority of these only aim at improving the performance of the detection for some specific dataset, and very few focus on the insight of novelty itself. We believe that online novelty detection, framed as an unsupervised learning problem, always requires a clear definition of novelty. Indeed, we argue the definition of novelty is the key issue in designing a good detection model. Within the context of FSD, we first categorise online novelty detection models into three main categories, based on different definitions of novelty scores, and then compare the performances of these model categories in different features spaces. Our experimental results show that the challenge of FSD varies across novelty scores (and corresponding model categories); and, furthermore, that the detection of novelty in the very popular Word2Vec feature space is more difficult than in a normal frequency-based feature space because of a loss of word specificity

Crossref

Arrow@TUDublin

Online Novelty Detection System: One-Class Classification of Systemic Operation

Author: Bowen Ryan M
Publication venue: RIT Scholar Works
Publication date: 04/01/2016
Field of study

Presented is an Online Novelty Detection System (ONDS) that uses Gaussian Mixture Models (GMMs) and one-class classification techniques to identify novel information from multivariate times-series data. Multiple data preprocessing methods are explored and features vectors formed from frequency components obtained by the Fast Fourier Transform (FFT) and Welch\u27s method of estimating Power Spectral Density (PSD). The number of features are reduced by using bandpower schemes and Principal Component Analysis (PCA). The Expectation Maximization (EM) algorithm is used to learn parameters for GMMs on feature vectors collected from only normal operational conditions. One-class classification is achieved by thresholding likelihood values relative to statistical limits. The ONDS is applied to two different applications from different application domains. The first application uses the ONDS to evaluate systemic health of Radio Frequency (RF) power generators. Four different models of RF power generators and over 400 unique units are tested, and the average robust true positive rate of 94.76% is achieved and the best specificity reported as 86.56%. The second application uses the ONDS to identify novel events from equine motion data and assess equine distress. The ONDS correctly identifies target behaviors as novel events with 97.5% accuracy. Algorithm implementation for both methods is evaluated within embedded systems and demonstrates execution times appropriate for online use

RIT Scholar Works

Distance,Time and Terms in First Story Detection

Author: Wang Fei
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2019
Field of study

First Story Detection (FSD) is an important application of online novelty detection within Natural Language Processing (NLP). Given a stream of documents, or stories, about news events in a chronological order, the goal of FSD is to identify the very ﬁrst story for each event. While a variety of NLP techniques have been applied to the task, FSD remains challenging because it is still not clear what is the most crucial factor in deﬁning the “story novelty”. Giventhesechallenges,thethesisaddressedinthisdissertationisthat the notion of novelty in FSD is multi-dimensional. To address this, the work presented has adopted a three dimensional analysis of the relative qualities of FSD systems and gone on to propose a speciﬁc method that wearguesigniﬁcantlyimprovesunderstandingandperformanceofFSD. FSD is of course not a new problem type; therefore, our ﬁrst dimen sion of analysis consists of a systematic study of detection models for ﬁrststorydetectionandthedistancesthatareusedinthedetectionmod els for deﬁning novelty. This analysis presents a tripartite categorisa tion of the detection models based on the end points of the distance calculation. The study also considers issues of document representation explicitly, and shows that even in a world driven by distributed repres iv entations,thenearestneighbourdetectionmodelwithTF-IDFdocument representations still achieves the state-of-the-art performance for FSD. Weprovideanalysisofthisimportantresultandsuggestpotentialcauses and consequences. Events are introduced and change at a relatively slow rate relative to the frequency at which words come in and out of usage on a docu ment by document basis. Therefore we argue that the second dimen sion of analysis should focus on the temporal aspects of FSD. Here we are concerned with not only the temporal nature of the detection pro cess, e.g., the time/history window over the stories in the data stream, but also the processes that underpin the representational updates that underpin FSD. Through a systematic investigation of static representa tions, and also dynamic representations with both low and high update frequencies, we show that while a dynamic model unsurprisingly out performs static models, the dynamic model in fact stops improving but stays steady when the update frequency gets higher than a threshold. Our third dimension of analysis moves across to the particulars of lexicalcontent,andcriticallytheaffectoftermsinthedeﬁnitionofstory novelty. Weprovideaspeciﬁcanalysisofhowtermsarerepresentedfor FSD, including the distinction between static and dynamic document representations, and the affect of out-of-vocabulary terms and the spe ciﬁcity of a word in the calculation of the distance. Our investigation showed that term distributional similarity rather than scale of common v terms across the background and target corpora is the most important factor in selecting background corpora for document representations in FSD. More crucially, in this work the simple idea of the new terms emerged as a vital factor in deﬁning novelty for the ﬁrst story

Arrow@TUDublin

Update Frequency and Background Corpus Selection in Dynamic TF-IDF Models for First Story Detection

Author: B Schölkopf
M Davies
M Davies
MA Pimentel
Y Goldberg
Publication venue: Dublin Institute of Technology
Publication date: 11/10/2019
Field of study

First Story Detection (FSD) requires a system to detect the very first story that mentions an event from a stream of stories. Nearest neighbour-based models, using the traditional term vector document representations like TF-IDF, currently achieve the state of the art in FSD. Because of its online nature, a dynamic term vector model that is incrementally updated during the detection process is usually adopted for FSD instead of a static model. However, very little research has investigated the selection of hyper-parameters and the background corpora for a dynamic model. In this paper, we analyse how a dynamic term vector model works for FSD, and investigate the impact of different update frequencies and background corpora on FSD performance. Our results show that dynamic models with high update frequencies outperform static model and dynamic models with low update frequencies; and that the FSD performance of dynamic models does not always increase with higher update frequencies, but instead reaches steady state after some update frequency threshold is reached. In addition, we demonstrate that different background corpora have very limited influence on the dynamic models with high update frequencies in terms of FSD performance

Crossref

Arrow@TUDublin

Time Series Data Mining Algorithms for Identifying Short RNA in Arabidopsis thaliana

Author: Bagnall Anthony
Moxon Simon
Studholme David
Publication venue: University of East Anglia
Publication date: 01/01/2007
Field of study

The class of molecules called short RNAs (sRNAs) are known to play a key role in gene regulation. Th are typically sequences of nucleotides between 21-25 nucleotides in length. They are known to play a key role in gene regulation. The identification, clustering and classification of sRNA has recently become the focus of much research activity. The basic problem involves detecting regions of interest on the chromosome where the pattern of candidate matches is somehow unusual. Currently, there are no published algorithms for detecting regions of interest, and the unpublished methods that we are aware of involve bespoke rule based systems designed for a specific organism. Work in this very new field has understandably focused on the outcomes rather than the methods used to obtain the results. In this paper we propose two generic approaches that place the specific biological problem in the wider context of time series data mining problems. Both methods are based on treating the occurrences on a chromosome, or “hit count” data, as a time series, then running a sliding window along a chromosome and measuring unusualness. This formulation means we can treat finding unusual areas of candidate RNA activity as a variety of time series anomaly detection problem. The first set of approaches is model based. We specify a null hypothesis distribution for not being a sRNA, then estimate the p-values along the chromosome. The second approach is instance based. We identify some typical shapes from known sRNA, then use dynamic time warping and fourier trans-form based distance to measure how closely the candidate series matches. We demonstrate that these methods can find known sRNA on Arabidopsis thaliana chromosomes and illustrate the benefits of the added information provided by these algorithms

University of East Anglia digital repository

Contextual Outlier Interpretation

Author: Hu Xia
Liu Ninghao
Shin Donghwa
Publication venue
Publication date: 04/05/2018
Field of study

Outlier detection plays an essential role in many data-driven applications to identify isolated instances that are different from the majority. While many statistical learning and data mining techniques have been used for developing more effective outlier detection algorithms, the interpretation of detected outliers does not receive much attention. Interpretation is becoming increasingly important to help people trust and evaluate the developed models through providing intrinsic reasons why the certain outliers are chosen. It is difficult, if not impossible, to simply apply feature selection for explaining outliers due to the distinct characteristics of various detection models, complicated structures of data in certain applications, and imbalanced distribution of outliers and normal instances. In addition, the role of contrastive contexts where outliers locate, as well as the relation between outliers and contexts, are usually overlooked in interpretation. To tackle the issues above, in this paper, we propose a novel Contextual Outlier INterpretation (COIN) method to explain the abnormality of existing outliers spotted by detectors. The interpretability for an outlier is achieved from three aspects: outlierness score, attributes that contribute to the abnormality, and contextual description of its neighborhoods. Experimental results on various types of datasets demonstrate the flexibility and effectiveness of the proposed framework compared with existing interpretation approaches

arXiv.org e-Print Archive

Crossref