8 research outputs found
Detecting change via competence model
In real world applications, interested concepts are more likely to change rather than remain stable, which is known as concept drift. This situation causes problems on predictions for many learning algorithms including case-base reasoning (CBR). When learning under concept drift, a critical issue is to identify and determine "when" and "how" the concept changes. In this paper, we developed a competence-based empirical distance between case chunks and then proposed a change detection method based on it. As a main contribution of our work, the change detection method provides an approach to measure the distribution change of cases of an infinite domain through finite samples and requires no prior knowledge about the case distribution, which makes it more practical in real world applications. Also, different from many other change detection methods, we not only detect the change of concepts but also quantify and describe this change. © 2010 Springer-Verlag
Neural visualization of network traffic data for intrusion detection
This study introduces and describes a novel intrusion detection system (IDS) called MOVCIDS (mobile visualization connectionist IDS). This system applies neural projection architectures to detect anomalous situations taking place in a computer network. By its advanced visualization facilities, the proposed IDS allows providing an overview of the network traffic as well as identifying anomalous situations tackled by computer networks, responding to the challenges presented by volume, dynamics and diversity of the traffic, including novel (0-day) attacks. MOVCIDS provides a novel point of view in the field of IDSs by enabling the most interesting projections (based on the fourth order statistics; the kurtosis index) of a massive traffic dataset to be extracted. These projections are then depicted through a functional and mobile visualization interface, providing visual information of the internal structure of the traffic data. The interface makes MOVCIDS accessible from any mobile device to give more accessibility to network administrators, enabling continuous visualization, monitoring and supervision of computer networks. Additionally, a novel testing technique has been developed to evaluate MOVCIDS and other IDSs employing numerical datasets. To show the performance and validate the proposed IDS, it has been tested in different real domains containing several attacks and anomalous situations. In addition, the importance of the temporal dimension on intrusion detection, and the ability of this IDS to process it, are emphasized in this workJunta de Castilla and Leon project BU006A08, Business intelligence for production within the framework of the Instituto Tecnologico de Cas-tilla y Leon (ITCL) and the Agencia de Desarrollo Empresarial (ADE), and the Spanish Ministry of Education and Innovation project CIT-020000-2008-2. The authors would also like to thank the vehicle interior manufacturer, Grupo Antolin Ingenieria S. A., within the framework of the project MAGNO2008-1028-CENIT Project funded by the Spanish Government
Learning Concept Drift Using Adaptive Training Set Formation Strategy
We live in a dynamic world, where changes are a part of everyday âs life. When there is a shift in data, the classification or prediction models need to be adaptive to the changes. In data mining the phenomenon of change in data distribution over time is known as concept drift. In this research, we propose an adaptive supervised learning with delayed labeling methodology. As a part of this methodology, we introduce an adaptive training set formation algorithm called SFDL, which is based on selective training set formation. Our proposed solution considered as the first systematic training set formation approach that take into account delayed labeling problem. It can be used with any base classifier without the need to change the implementation or setting of this classifier. We test our algorithm implementation using synthetic and real dataset from various domains which might have different drift types (sudden, gradual, incremental recurrences) with different speed of change. The experimental results confirm improvement in classification accuracy as compared to ordinary classifier for all drift types. Our approach is able to increase the classifications accuracy with 20% in average and 56% in the best cases of our experimentations and it has not been worse than the ordinary classifiers in any case. Finally a comparison study with other four related methods to deal with changing in user interest over time and handle recurrence drift is performed. Results indicate the effectiveness of the proposed method over other methods in terms of classification accuracy
COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments
An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
âą To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
âą To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
Recommended from our members
Expressive and modular rule-based classifier for data streams
The advances in computing software, hardware, connected devices and wireless
communication infrastructure in recent years have led to the desire to
work with streaming data sources. Yet the number of techniques, approaches
and algorithms which can work with data from a streaming source is still very
limited, compared with batched data. Although data mining techniques have
been a well-studied topic of knowledge discovery for decades, many unique
properties as well as challenges in learning from a data stream have not been
considered properly due to the actual presence of and the real needs to mine
information from streaming data sources. This thesis aims to contribute to
the knowledge by developing a rule-based algorithm to specifically learn classification
rules from data streams, with the learned rules are expressive so
that a human user can easily interpret the concept and rationale behind the
predictions of the created model. There are two main structures to represent
a classification model; the âtree-basedâ structure and the ârule-basedâ structure.
Even though both forms of representation are popular and well-known
in traditional data mining, they are different when it comes to interpretability
and quality of models in certain circumstances.
The first part of this thesis analyses background work and relevant topics in learning classification rules from data streams. This study provides information
about the essential requirements to produce high quality classification
rules from data streams and how many systems, algorithms and techniques
related to learn the classification of a static dataset are not applicable in a
streaming environment.
The second part of the thesis investigates at a new technique to improve
the efficiency and accuracy in learning heuristics from numeric features from
a streaming data source. The computational cost is one of the important factors
to be considered for an effective and practical learning algorithm/system
because of the needs to learn from continuous arrivals of data examples sequentially
and discard the seen data examples. If the computing cost is too
expensive, then one may not be able to keep pace with the arrival of high
velocity and possibly unbound data streams. The proposed technique was
first discussed in the context of the use of Gaussian distribution as heuristics
for building rule terms on numeric features. Secondly, empirical evaluation
shows the successful integration of the proposed technique into an existing
rule-based algorithm for the data stream, eRules.
Continuing on the topic of a rule-based algorithm for classification data
streams, the use of Hoeffdingâs Inequality addresses another problem in learning
from a data stream, namely how much data should be seen from a data
stream before starting learning and how to keep the model updated over time.
By incorporating the theory from Hoeffdingâs Inequality, this study presents
the Hoeffding Rules algorithm, which can induce modular rules directly from
a streaming data source with dynamic window sizes throughout the learning
period to ensure the efficiency and robustness towards the concept drifts. Concept drift is another unique challenge in mining data streams which the
underlying concept of the data can change either gradually or abruptly over
time and the learner should adapt to these changes as quickly as possible.
This research focuses on the development of a rule-based algorithm, Hoeffding
Rules, for data stream which considers streaming environments as
primary data sources and addresses several unique challenges in learning
rules from data streams such as concept drifts and computational efficiency.
This knowledge facilitates the need and the importance of an interpretable
machine learning model; applying new studies to improve the ability to mine
useful insights from potentially high velocity, high volume and unbounded
data streams. More broadly, this research complements the study in learning
classification rules from data streams to address some of the unique challenges
in data streams compared with conventional batch data, with the
knowledge necessary to systematically and effectively learn expressive and
modular classification rules from data streams