7 research outputs found

    Real-Time Adaptive Event Detection in Astronomical Data Streams

    Get PDF
    A new generation of observational science instruments is dramatically increasing collected data volumes in a range of fields. These instruments include the Square Kilometer Array (SKA), Large Synoptic Survey Telescope (LSST), terrestrial sensor networks, and NASA satellites participating in "decadal survey"' missions. Their unprecedented coverage and sensitivity will likely reveal wholly new categories of unexpected and transient events. Commensal methods passively analyze these data streams, recognizing anomalous events of scientific interest and reacting in real time. Here, the authors report on a case example: Very Long Baseline Array Fast Transients Experiment (V-FASTR), an ongoing commensal experiment at the Very Long Baseline Array (VLBA) that uses online adaptive pattern recognition to search for anomalous fast radio transients. V-FASTR triages a millisecond-resolution stream of data and promotes candidate anomalies for further offline analysis. It tunes detection parameters in real time, injecting synthetic events to continually retrain itself for optimum performance. This self-tuning approach retains sensitivity to weak signals while adapting to changing instrument configurations and noise conditions. The system has operated since July 2011, making it the longest-running real-time commensal radio transient experiment to date

    An efficient framework for mining outlying aspects

    Get PDF
    In the era of big data, an immense volume of information is being continuously generated. It is common to encounter errors or anomalies within datasets. These anomalies can arise due to system malfunctions or human errors, resulting in data points that deviate from expected patterns or values. Anomaly detection algorithms have been developed to identify such anomalies effectively from the data. However, these algorithms often fall short in providing insights into why a particular data point is considered as an anomaly. They cannot explain the specific feature subset(s) in which a,data point significantly differs from the majority of the data. To address this limitation, researchers have recently turned their attention,to a new research area called outlying aspect mining. This area focuses on discovering feature subset(s), known as aspects or subspaces, in which anomalous data points exhibit significant deviations from the remaining set of data. Outlying aspect mining algorithms aim to provide a more detailed,understanding of the characteristics that make a data point anomalous. Although outlying aspect mining is an emerging area of research only a few studies have been published so far. One of the key challenges in this field is scaling up these algorithms to handle large datasets, characterised by,either a large data size or high dimensionality. Many existing outlying aspect mining algorithms are not well-suited for such datasets, as they exhaustively enumerate all possible subspaces and utilise density or distance-based anomaly scores to rank subspaces. As a result, most of these algorithms struggle to handle datasets with dimensions exceeding 20. Addressing the scalability issue and developing efficient algorithms for outlying aspect mining in large datasets remain active area of research. The ability to identify and understand the specific feature subsets contributing to anomalies in big data holds great potential for various applications, including fraud detection, network intrusion detection, and anomaly-based decision support systems. Existing outlying aspect mining methods are suffering from three main problems. Firstly, these measures often rely on distance or density-based calculations, which can be biased toward high-dimensional spaces. As the dimensionality of the subspace increases, the density tends to decrease, making it difficult to assess the outlyingness of data points within specific subspaces accurately. Secondly, distances or density-based measures are computationally expensive, especially when dealing with large-scale datasets that contain millions of data points. As distance and density-based measures require computing pairwise distance, it makes them computationally expensive. In addition to that, existing work uses Z-Score normalisation to make density-based scoring measure dimensionally unbias. However, it adds additional computational overload on already computationally expensive measures. Lastly, existing outlying aspect mining uses brute-force methods to search subspaces. Thus, it is essential to tackle this efficiency issue because when the dimensionality of the data is high – candidate subspace grows exponentially, which is beyond computational resources. This research project aims to solve this challenge by developing efficient and effective methods for mining outlying aspects in high-dimensional and large datasets. I have explored and designed different scoring measures to find the outlyingness of a given data point in each subspace. The effectiveness and efficiency of these proposed measures have been verified with extensive experiments on synthetic and real-world datasets. To overcome the first problem, this thesis first identifies and analyses the condition under which Z-Score based normalisation scoring measure fails to find the most outlying aspects and proposes two different approaches called HMass and sGrid++, both measures are dimensionally unbiased in their raw form, which means they do not require any additional normalisation. sGrid++ is a simpler version of sGrid that is not only efficient and effective but also dimensionality unbiased. It does not require Z-score normalisation. HMass is a simple but effective and efficient histogram-based solution to rank outlying aspects of a given query in each subspace. In addition to detecting anomalies, HMass provides explanations on why the points are anomalous. Both sGrid++ and HMass do not require pair-wise calculation like distance or density-based measures; therefore, sGrid++ and HMass are computationally faster than distance and density-based measures, which solves the second issue of existing work. The effectiveness and efficiency of both sGrid++ and HMass are evaluated using synthetic and real-world datasets. In addition to that, I presented an exciting application of outlying aspect mining in the cybersecurity domain. To tackle the third problem, this thesis proposes an efficient and effective outlying aspect mining framework named OIMiner (for Outlying - Inlying Aspect Miner). It introduces a new scoring measure to compute outlying degree, called Simple Isolation score using Nearest Neighbor Ensemble (SiNNE), which not only detects the outliers but also provides an explanation on why the selected point is an outlier. SiNNE is a dimensionally unbias measure in its raw form, which means the scores produced by SiNNE are compared directly with subspaces having different dimensions. Thus, it does not require any normalisation to make the score unbiased. Our experimental results on synthetic and publicly available real-world datasets revealed that (i) SiNNE produces better or at least the same results as existing scores. (ii) It improves the run time of the existing outlying aspect mining algorithm based on beam search by at least two orders of magnitude. SiNNE allows the existing outlying aspect mining algorithm to run in datasets with hundreds of thousands of instances and thousands of dimensions, which was not possible before.Doctor of Philosoph

    On high-dimensional support recovery and signal detection

    Get PDF

    On high-dimensional support recovery and signal detection

    Get PDF

    Real Time Adaptive Event Detection in Astronomical Data Streams: Lessons from the Very Long Baseline Array

    No full text
    A new generation of observational science instruments is dramatically increasing collected data volumes in a range of fields. These instruments include the Square Kilometre Array (SKA), Large Synoptic Survey Telescope (LSST), terrestrial sensor networks, and NASA satellites participating in "decadal survey" missions. Their unprecedented coverage and sensitivity will likely reveal wholly new categories of unexpected and transient events. Commensal methods passively analyze these data streams, recognizing anomalous events of scientific interest and reacting in real time. We report on a case example: V-FASTR, an ongoing commensal experiment at the Very Long Baseline Array (VLBA) that uses online adaptive pattern recognition to search for anomalous fast radio transients. V-FASTR triages a millisecond-resolution stream of data and promotes candidate anomalies for further offline analysis. It tunes detection parameters in real time, injecting synthetic events to continually retrain itself for optimum performance. This self-tuning approach retains sensitivity to weak signals while adapting to changing instrument configurations and noise conditions. The system has operated since July 2011, making it the longest-running real time commensal radio transient experiment to date
    corecore