661 research outputs found
Data Stream Clustering: A Review
Number of connected devices is steadily increasing and these devices
continuously generate data streams. Real-time processing of data streams is
arousing interest despite many challenges. Clustering is one of the most
suitable methods for real-time data stream processing, because it can be
applied with less prior information about the data and it does not need labeled
instances. However, data stream clustering differs from traditional clustering
in many aspects and it has several challenging issues. Here, we provide
information regarding the concepts and common characteristics of data streams,
such as concept drift, data structures for data streams, time window models and
outlier detection. We comprehensively review recent data stream clustering
algorithms and analyze them in terms of the base clustering technique,
computational complexity and clustering accuracy. A comparison of these
algorithms is given along with still open problems. We indicate popular data
stream repositories and datasets, stream processing tools and platforms. Open
problems about data stream clustering are also discussed.Comment: Has been accepted for publication in Artificial Intelligence Revie
Data Stream Clustering: Challenges and Issues
Very large databases are required to store massive amounts of data that are
continuously inserted and queried. Analyzing huge data sets and extracting
valuable pattern in many applications are interesting for researchers. We can
identify two main groups of techniques for huge data bases mining. One group
refers to streaming data and applies mining techniques whereas second group
attempts to solve this problem directly with efficient algorithms. Recently
many researchers have focused on data stream as an efficient strategy against
huge data base mining instead of mining on entire data base. The main problem
in data stream mining means evolving data is more difficult to detect in this
techniques therefore unsupervised methods should be applied. However,
clustering techniques can lead us to discover hidden information. In this
survey, we try to clarify: first, the different problem definitions related to
data stream clustering in general; second, the specific difficulties
encountered in this field of research; third, the varying assumptions,
heuristics, and intuitions forming the basis of different approaches; and how
several prominent solutions tackle different problems. Index Terms- Data
Stream, Clustering, K-Means, Concept driftComment: IMECS201
rEMM: Extensible Markov Model for Data Stream Clustering in R
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov chain. In this paper we introduce the implementation of the R extension package rEMM which implements EMM and we discuss some examples and applications.
Gustafson-Kessel Algorithm for Evolving Data Stream Clustering
A simplified clustering algorithm that enables on-line
partitioning of data streams is proposed. The algorithm applies
adaptive-distance metric to identify clusters with different shape and
orientation. It is applicable to a wide range of practical evolving system
type applications as diagnostics and prognostics, system identification,
real time classification, and process quality monitoring and control
An Improved Differential Evolution Algorithm for Data Stream Clustering
A Few algorithms were actualized by the analysts for performing clustering of data streams. Most of these algorithms require that the number of clusters (K) has to be fixed by the customer based on input data and it can be kept settled all through the clustering process. Stream clustering has faced few difficulties in picking up K. In this paper, we propose an efficient approach for data stream clustering by embracing an Improved Differential Evolution (IDE) algorithm. The IDE algorithm is one of the quick, powerful and productive global optimization approach for programmed clustering. In our proposed approach, we additionally apply an entropy based method for distinguishing the concept drift in the data stream and in this way updating the clustering procedure online. We demonstrated that our proposed method is contrasted with Genetic Algorithm and identified as proficient optimization algorithm. The performance of our proposed technique is assessed and cr eates the accuracy of 92.29%, the precision is 86.96%, recall is 90.30% and F-measure estimate is 88.60%
rEMM: Extensible Markov Model for Data Stream Clustering in R
Clustering streams of continuously arriving data has become an important application of data mining in recent years and efficient algorithms have been proposed by several researchers. However, clustering alone neglects the fact that data in a data stream is not only characterized by the proximity of data points which is used by clustering, but also by a temporal component. The extensible Markov model (EMM) adds the temporal component to data stream clustering by superimposing a dynamically adapting Markov chain. In this paper we introduce the implementation of the <b>R</b> extension package <b>rEMM</b> which implements EMM and we discuss some examples and applications
Benne: A Modular and Self-Optimizing Algorithm for Data Stream Clustering
In various real-world applications, ranging from the Internet of Things (IoT)
to social media and financial systems, data stream clustering is a critical
operation. This paper introduces Benne, a modular and highly configurable data
stream clustering algorithm designed to offer a nuanced balance between
clustering accuracy and computational efficiency. Benne distinguishes itself by
clearly demarcating four pivotal design dimensions: the summarizing data
structure, the window model for handling data temporality, the outlier
detection mechanism, and the refinement strategy for improving cluster quality.
This clear separation not only facilitates a granular understanding of the
impact of each design choice on the algorithm's performance but also enhances
the algorithm's adaptability to a wide array of application contexts. We
provide a comprehensive analysis of these design dimensions, elucidating the
challenges and opportunities inherent to each. Furthermore, we conduct a
rigorous performance evaluation of Benne, employing diverse configurations and
benchmarking it against existing state-of-the-art data stream clustering
algorithms. Our empirical results substantiate that Benne either matches or
surpasses competing algorithms in terms of clustering accuracy, processing
throughput, and adaptability to varying data stream characteristics. This
establishes Benne as a valuable asset for both practitioners and researchers in
the field of data stream mining
- …