5,672 research outputs found

    Ant colony stream clustering: A fast density clustering algorithm for dynamic data streams

    Get PDF
    A data stream is a continuously arriving sequence of data and clustering data streams requires additional considerations to traditional clustering. A stream is potentially unbounded, data points arrive on-line and each data point can be examined only once. This imposes limitations on available memory and processing time. Furthermore, streams can be noisy and the number of clusters in the data and their statistical properties can change over time. This paper presents an on-line, bio-inspired approach to clustering dynamic data streams. The proposed Ant-Colony Stream Clustering (ACSC) algorithm is a density based clustering algorithm, whereby clusters are identified as high-density areas of the feature space separated by low-density areas. ACSC identifies clusters as groups of micro-clusters. The tumbling window model is used to read a stream and rough clusters are incrementally formed during a single pass of a window. A stochastic method is employed to find these rough clusters, this is shown to significantly speed the algorithm with only a minor cost to performance, as compared to a deterministic approach. The rough clusters are then refined using a method inspired by the observed sorting behaviour of ants. Ants pick-up and drop items based on the similarity with the surrounding items. Artificial ants sort clusters by probabilistically picking and dropping micro-clusters based on local density and local similarity. Clusters are summarised using their constituent micro-clusters and these summary statistics are stored offline. Experimental results show that the clustering quality of ACSC is scalable, robust to noise and favourable to leading ant-clustering and stream-clustering algorithms. It also requires fewer parameters and less computational time

    Finding and tracking multi-density clusters in an online dynamic data stream

    Get PDF
    The file attached to this record is the author's final peer reviewed version.Change is one of the biggest challenges in dynamic stream mining. From a data-mining perspective, adapting and tracking change is desirable in order to understand how and why change has occurred. Clustering, a form of unsupervised learning, can be used to identify the underlying patterns in a stream. Density-based clustering identifies clusters as areas of high density separated by areas of low density. This paper proposes a Multi-Density Stream Clustering (MDSC) algorithm to address these two problems; the multi-density problem and the problem of discovering and tracking changes in a dynamic stream. MDSC consists of two on-line components; discovered, labelled clusters and an outlier buffer. Incoming points are assigned to a live cluster or passed to the outlier buffer. New clusters are discovered in the buffer using an ant-inspired swarm intelligence approach. The newly discovered cluster is uniquely labelled and added to the set of live clusters. Processed data is subject to an ageing function and will disappear when it is no longer relevant. MDSC is shown to perform favourably to state-of-the-art peer stream-clustering algorithms on a range of real and synthetic data-streams. Experimental results suggest that MDSC can discover qualitatively useful patterns while being scalable and robust to noise

    Dynamic feature selection for clustering high dimensional data streams

    Get PDF
    open access articleChange in a data stream can occur at the concept level and at the feature level. Change at the feature level can occur if new, additional features appear in the stream or if the importance and relevance of a feature changes as the stream progresses. This type of change has not received as much attention as concept-level change. Furthermore, a lot of the methods proposed for clustering streams (density-based, graph-based, and grid-based) rely on some form of distance as a similarity metric and this is problematic in high-dimensional data where the curse of dimensionality renders distance measurements and any concept of “density” difficult. To address these two challenges we propose combining them and framing the problem as a feature selection problem, specifically a dynamic feature selection problem. We propose a dynamic feature mask for clustering high dimensional data streams. Redundant features are masked and clustering is performed along unmasked, relevant features. If a feature's perceived importance changes, the mask is updated accordingly; previously unimportant features are unmasked and features which lose relevance become masked. The proposed method is algorithm-independent and can be used with any of the existing density-based clustering algorithms which typically do not have a mechanism for dealing with feature drift and struggle with high-dimensional data. We evaluate the proposed method on four density-based clustering algorithms across four high-dimensional streams; two text streams and two image streams. In each case, the proposed dynamic feature mask improves clustering performance and reduces the processing time required by the underlying algorithm. Furthermore, change at the feature level can be observed and tracked

    Learning in Dynamic Data-Streams with a Scarcity of Labels

    Get PDF
    Analysing data in real-time is a natural and necessary progression from traditional data mining. However, real-time analysis presents additional challenges to batch-analysis; along with strict time and memory constraints, change is a major consideration. In a dynamic stream there is an assumption that the underlying process generating the stream is non-stationary and that concepts within the stream will drift and change over time. Adopting a false assumption that a stream is stationary will result in non-adaptive models degrading and eventually becoming obsolete. The challenge of recognising and reacting to change in a stream is compounded by the scarcity of labels problem. This refers to the very realistic situation in which the true class label of an incoming point is not immediately available (or will never be available) or in situations where manually labelling incoming points is prohibitively expensive. The goal of this thesis is to evaluate unsupervised learning as the basis for online classification in dynamic data-streams with a scarcity of labels. To realise this goal, a novel stream clustering algorithm based on the collective behaviour of ants (Ant Colony Stream Clustering (ACSC)) is proposed. This algorithm is shown to be faster and more accurate than comparative, peer stream-clustering algorithms while requiring fewer sensitive parameters. The principles of ACSC are extended in a second stream-clustering algorithm named Multi-Density Stream Clustering (MDSC). This algorithm has adaptive parameters and crucially, can track clusters and monitor their dynamic behaviour over time. A novel technique called a Dynamic Feature Mask (DFM) is proposed to ``sit on top’’ of these stream-clustering algorithms and can be used to observe and track change at the feature level in a data stream. This Feature Mask acts as an unsupervised feature selection method allowing high-dimensional streams to be clustered. Finally, data-stream clustering is evaluated as an approach to one-class classification and a novel framework (named COCEL: Clustering and One class Classification Ensemble Learning) for classification in dynamic streams with a scarcity of labels is described. The proposed framework can identify and react to change in a stream and hugely reduces the number of required labels (typically less than 0.05% of the entire stream)

    ANTIDS: Self-Organized Ant-based Clustering Model for Intrusion Detection System

    Full text link
    Security of computers and the networks that connect them is increasingly becoming of great significance. Computer security is defined as the protection of computing systems against threats to confidentiality, integrity, and availability. There are two types of intruders: the external intruders who are unauthorized users of the machines they attack, and internal intruders, who have permission to access the system with some restrictions. Due to the fact that it is more and more improbable to a system administrator to recognize and manually intervene to stop an attack, there is an increasing recognition that ID systems should have a lot to earn on following its basic principles on the behavior of complex natural systems, namely in what refers to self-organization, allowing for a real distributed and collective perception of this phenomena. With that aim in mind, the present work presents a self-organized ant colony based intrusion detection system (ANTIDS) to detect intrusions in a network infrastructure. The performance is compared among conventional soft computing paradigms like Decision Trees, Support Vector Machines and Linear Genetic Programming to model fast, online and efficient intrusion detection systems.Comment: 13 pages, 3 figures, Swarm Intelligence and Patterns (SIP)- special track at WSTST 2005, Muroran, JAPA

    Finding multi-density clusters in non-stationary data streams using an ant colony with adaptive parameters

    Get PDF
    The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.Density based methods have been shown to be an effective approach for clustering non-stationary data streams. The number of clusters does not need to be known a priori and density methods are robust to noise and changes in the statistical properties of the data. However, most density approaches require sensitive, data dependent parameters. These parameters greatly affect the clustering performance and in a dynamic stream a good set of parameters at time t are not necessarily the best at time t+1. Furthermore, these parameters are global and so restrict the algorithm to finding clusters of the same density. In this paper, we propose a density based algorithm with adaptive parameters which are local to each discovered cluster. The algorithm, denoted Ant Colony Multi-Density Clustering (ACMDC), uses artificial ants to form nests in dense areas of the data. As the ants move between nests, their collective memory is stored in the form of pheromone trails. Clusters are identified as groups of similar nests. The proposed algorithm is evaluated across a number of synthetic data streams containing overlapping and embedded multi-density clusters. The performance of the algorithm is shown to be favourable to a leading density based stream-clustering algorithm despite requiring no tunable parameters

    Swarm intelligence for clustering dynamic data sets for web usage mining and personalization.

    Get PDF
    Swarm Intelligence (SI) techniques were inspired by bee swarms, ant colonies, and most recently, bird flocks. Flock-based Swarm Intelligence (FSI) has several unique features, namely decentralized control, collaborative learning, high exploration ability, and inspiration from dynamic social behavior. Thus FSI offers a natural choice for modeling dynamic social data and solving problems in such domains. One particular case of dynamic social data is online/web usage data which is rich in information about user activities, interests and choices. This natural analogy between SI and social behavior is the main motivation for the topic of investigation in this dissertation, with a focus on Flock based systems which have not been well investigated for this purpose. More specifically, we investigate the use of flock-based SI to solve two related and challenging problems by developing algorithms that form critical building blocks of intelligent personalized websites, namely, (i) providing a better understanding of the online users and their activities or interests, for example using clustering techniques that can discover the groups that are hidden within the data; and (ii) reducing information overload by providing guidance to the users on websites and services, typically by using web personalization techniques, such as recommender systems. Recommender systems aim to recommend items that will be potentially liked by a user. To support a better understanding of the online user activities, we developed clustering algorithms that address two challenges of mining online usage data: the need for scalability to large data and the need to adapt cluster sing to dynamic data sets. To address the scalability challenge, we developed new clustering algorithms using a hybridization of traditional Flock-based clustering with faster K-Means based partitional clustering algorithms. We tested our algorithms on synthetic data, real VCI Machine Learning repository benchmark data, and a data set consisting of real Web user sessions. Having linear complexity with respect to the number of data records, the resulting algorithms are considerably faster than traditional Flock-based clustering (which has quadratic complexity). Moreover, our experiments demonstrate that scalability was gained without sacrificing quality. To address the challenge of adapting to dynamic data, we developed a dynamic clustering algorithm that can handle the following dynamic properties of online usage data: (1) New data records can be added at any time (example: a new user is added on the site); (2) Existing data records can be removed at any time. For example, an existing user of the site, who no longer subscribes to a service, or who is terminated because of violating policies; (3) New parts of existing records can arrive at any time or old parts of the existing data record can change. The user\u27s record can change as a result of additional activity such as purchasing new products, returning a product, rating new products, or modifying the existing rating of a product. We tested our dynamic clustering algorithm on synthetic dynamic data, and on a data set consisting of real online user ratings for movies. Our algorithm was shown to handle the dynamic nature of data without sacrificing quality compared to a traditional Flock-based clustering algorithm that is re-run from scratch with each change in the data. To support reducing online information overload, we developed a Flock-based recommender system to predict the interests of users, in particular focusing on collaborative filtering or social recommender systems. Our Flock-based recommender algorithm (FlockRecom) iteratively adjusts the position and speed of dynamic flocks of agents, such that each agent represents a user, on a visualization panel. Then it generates the top-n recommendations for a user based on the ratings of the users that are represented by its neighboring agents. Our recommendation system was tested on a real data set consisting of online user ratings for a set of jokes, and compared to traditional user-based Collaborative Filtering (CF). Our results demonstrated that our recommender system starts performing at the same level of quality as traditional CF, and then, with more iterations for exploration, surpasses CF\u27s recommendation quality, in terms of precision and recall. Another unique advantage of our recommendation system compared to traditional CF is its ability to generate more variety or diversity in the set of recommended items. Our contributions advance the state of the art in Flock-based 81 for clustering and making predictions in dynamic Web usage data, and therefore have an impact on improving the quality of online services
    • …
    corecore