3,858 research outputs found
Recommended from our members
Real-time feature selection technique with concept drift detection using adaptive micro-clusters for data stream mining
Data streams are unbounded, sequential data instances that are generated with high Velocity. Classifying sequential data instances is a very challenging problem in machine learning with applications in network intrusion detection, financial markets and applications requiring real-time sensor-networks-based situation assessment. Data stream classification is concerned with the automatic labelling of unseen instances from the stream in real-time. For this the classifier needs to adapt to concept drifts and can only have a single pass through the data if the stream is fast moving. This research paper presents work on a real-time pre-processing technique, in particular feature tracking. The feature tracking technique is designed to improve Data Stream Mining (DSM) classification algorithms by enabling and optimising real-time feature selection. The technique is based on tracking adaptive statistical summaries of the data and class label distributions, known as Micro-Clusters. Currently the technique is able to detect concept drifts and identify which features have been influential in the drift
Recommended from our members
Towards real-time feature tracking technique using adaptive micro-clusters
Data streams are unbounded, sequential data instances that are generated with high velocity. Classifying sequential data instances is a very challenging problem in machine learning with applications in network intrusion detection, ļ¬nancial markets and sensor networks. Data stream classiļ¬cation is concerned with the automatic labelling of unseen instances from the stream in real-time. For this the classiļ¬er needs to adapt to concept drifts and can only have a single pass through the data if the stream is fast. This research paper presents our work on a real-time pre-processing technique, in particular a feature tracking technique that takes concept drift into consideration. The feature tracking technique is designed to improve Data Stream Mining (DSM) classiļ¬cation algorithms by enabling real-time feature selection. The technique is based on adaptive summaries of the data and class distributions, known as Micro-Clusters. Currently the technique is able to detect concept drift and identiļ¬es which features have been involved
Recommended from our members
Towards online concept drift detection with feature selection for data stream classification
Data Streams are unbounded, sequential data instances that are generated very rapidly. The storage, querying and mining of such rapid flows of data is computationally very challenging. Data Stream Mining (DSM) is concerned with the mining of such data streams in real-time using techniques that require only one pass through the data. DSM techniques need to be adaptive to reflect changes of the pattern encoded in the stream (concept drift). The relevance of features for a DSM classification task may change due to concept drifts and this paper describes the first step towards a concept drift detection method with online feature tracking capabilities
Data stream mining techniques: a review
A plethora of infinite data is generated from the Internet and other information sources. Analyzing this massive data in real-time and extracting valuable knowledge using different mining applications platforms have been an area for research and industry as well. However, data stream mining has different challenges making it different from traditional data mining. Recently, many studies have addressed the concerns on massive data mining problems and proposed several techniques that produce impressive results. In this paper, we review real time clustering and classification mining techniques for data stream. We analyze the characteristics of data stream mining and discuss the challenges and research issues of data steam mining. Finally, we present some of the platforms for data stream mining
Learning in Dynamic Data-Streams with a Scarcity of Labels
Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on topāā of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)
Data Stream Clustering: A Review
Number of connected devices is steadily increasing and these devices
continuously generate data streams. Real-time processing of data streams is
arousing interest despite many challenges. Clustering is one of the most
suitable methods for real-time data stream processing, because it can be
applied with less prior information about the data and it does not need labeled
instances. However, data stream clustering differs from traditional clustering
in many aspects and it has several challenging issues. Here, we provide
information regarding the concepts and common characteristics of data streams,
such as concept drift, data structures for data streams, time window models and
outlier detection. We comprehensively review recent data stream clustering
algorithms and analyze them in terms of the base clustering technique,
computational complexity and clustering accuracy. A comparison of these
algorithms is given along with still open problems. We indicate popular data
stream repositories and datasets, stream processing tools and platforms. Open
problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie
Recommended from our members
Real-time pre-processing technique for drift detection, feature tracking, and feature selection using adaptive micro-clusters for data stream classification
Data streams are unbounded, sequential data instances that are generated with high Velocity.
Data streams arrive online (i.e., instance by instance) and there is no control over the order
in which data instances arrive either within a data stream or across data streams. Classifying
sequential data instances is a challenging problem in machine learning with applications in
network intrusion detection, financial markets and sensor networks. The automatic labelling
of unseen instances from the stream in real-time is the main challenge that data stream classification
faces. For this, the classifier needs to adapt to concept drifts and can only have a
single-pass through the data with a limited amount of memory if the stream is generating data
instances at a high Velocity. Nowadays the focus of Data Stream Mining (DSM) lies in the
development of data mining algorithms rather than on pre-processing techniques. To the best
of the author knowledge, at present, there are no developments for truly real-time feature selection
in a streaming setting. This research work presents a real-time pre-processing technique,
in particular, feature tracking in combination with concept drift detection. The feature tracking
is designed to improve DSM classification algorithms by enabling real-time feature selection.
The pre-processing technique is based on tracking adaptive statistical summaries of the data
and class label distributions, known as Micro-Clusters. Thus the three objectives of this research
were to develop a real-time pre-processing technique that can (1) detect a concept drift,
(2) identify features that were involved in concept drift and thus potentially change their relevance
and (3) build a real-time feature selection method based on the developments mentioned
above. The evaluation of the developed technique is based on artificial data streams with known
ground truth and real datasets with and without artificially induced concept drift (i.e., controlled
and uncontrolled real datasets). It was observed that the developed method for concept drift
detection did detect induced concept drifts very well compared with alternative concept drift
detection methods. Overall the research represents a first attempt to resolve real-time feature
selection for DSM tasks. It has been shown that the technique can indeed identify concept drift,
track features, and identify features that may have changed their relevance for the DSM task in
real-time. It has also been shown that the developed method for real-time feature selection can
improve the accuracy of data stream classification tasks
Process-Oriented Stream Classification Pipeline:A Literature Review
Featured Application: Nowadays, many applications and disciplines work on the basis of stream data. Common examples are the IoT sector (e.g., sensor data analysis), or video, image, and text analysis applications (e.g., in social media analytics or astronomy). With our work, we gather different approaches and terminology, and give a broad overview over the topic. Our main target groups are practitioners and newcomers to the field of data stream classification. Due to the rise of continuous data-generating applications, analyzing data streams has gained increasing attention over the past decades. A core research area in stream data is stream classification, which categorizes or detects data points within an evolving stream of observations. Areas of stream classification are diverseāranging, e.g., from monitoring sensor data to analyzing a wide range of (social) media applications. Research in stream classification is related to developing methods that adapt to the changing and potentially volatile data stream. It focuses on individual aspects of the stream classification pipeline, e.g., designing suitable algorithm architectures, an efficient train and test procedure, or detecting so-called concept drifts. As a result of the many different research questions and strands, the field is challenging to grasp, especially for beginners. This survey explores, summarizes, and categorizes work within the domain of stream classification and identifies core research threads over the past few years. It is structured based on the stream classification process to facilitate coordination within this complex topic, including common application scenarios and benchmarking data sets. Thus, both newcomers to the field and experts who want to widen their scope can gain (additional) insight into this research area and find starting points and pointers to more in-depth literature on specific issues and research directions in the field.</p
Concept Drift Detection in Data Stream Mining: The Review of Contemporary Literature
Mining process such as classification, clustering of progressive or dynamic data is a critical objective of the information retrieval and knowledge discovery; in particular, it is more sensitive in data stream mining models due to the possibility of significant change in the type and dimensionality of the data over a period. The influence of these changes over the mining process termed as concept drift. The concept drift that depict often in streaming data causes unbalanced performance of the mining models adapted. Hence, it is obvious to boost the mining models to predict and analyse the concept drift to achieve the performance at par best. The contemporary literature evinced significant contributions to handle the concept drift, which fall in to supervised, unsupervised learning, and statistical assessment approaches. This manuscript contributes the detailed review of the contemporary concept-drift detection models depicted in recent literature. The contribution of the manuscript includes the nomenclature of the concept drift models and their impact of imbalanced data tuples
- ā¦