2,023 research outputs found
Dynamic Data Mining: Methodology and Algorithms
Supervised data stream mining has become an important and challenging data mining task in modern
organizations. The key challenges are threefold: (1) a possibly infinite number of streaming examples
and time-critical analysis constraints; (2) concept drift; and (3) skewed data distributions.
To address these three challenges, this thesis proposes the novel dynamic data mining (DDM)
methodology by effectively applying supervised ensemble models to data stream mining. DDM can be
loosely defined as categorization-organization-selection of supervised ensemble models. It is inspired
by the idea that although the underlying concepts in a data stream are time-varying, their distinctions
can be identified. Therefore, the models trained on the distinct concepts can be dynamically selected in
order to classify incoming examples of similar concepts.
First, following the general paradigm of DDM, we examine the different concept-drifting stream
mining scenarios and propose corresponding effective and efficient data mining algorithms.
• To address concept drift caused merely by changes of variable distributions, which we term
pseudo concept drift, base models built on categorized streaming data are organized and
selected in line with their corresponding variable distribution characteristics.
• To address concept drift caused by changes of variable and class joint distributions, which we
term true concept drift, an effective data categorization scheme is introduced. A group of
working models is dynamically organized and selected for reacting to the drifting concept.
Secondly, we introduce an integration stream mining framework, enabling the paradigm advocated by
DDM to be widely applicable for other stream mining problems. Therefore, we are able to introduce
easily six effective algorithms for mining data streams with skewed class distributions.
In addition, we also introduce a new ensemble model approach for batch learning, following the same
methodology. Both theoretical and empirical studies demonstrate its effectiveness.
Future work would be targeted at improving the effectiveness and efficiency of the proposed
algorithms. Meantime, we would explore the possibilities of using the integration framework to solve
other open stream mining research problems
Adaptive Online Sequential ELM for Concept Drift Tackling
A machine learning method needs to adapt to over time changes in the
environment. Such changes are known as concept drift. In this paper, we propose
concept drift tackling method as an enhancement of Online Sequential Extreme
Learning Machine (OS-ELM) and Constructive Enhancement OS-ELM (CEOS-ELM) by
adding adaptive capability for classification and regression problem. The
scheme is named as adaptive OS-ELM (AOS-ELM). It is a single classifier scheme
that works well to handle real drift, virtual drift, and hybrid drift. The
AOS-ELM also works well for sudden drift and recurrent context change type. The
scheme is a simple unified method implemented in simple lines of code. We
evaluated AOS-ELM on regression and classification problem by using concept
drift public data set (SEA and STAGGER) and other public data sets such as
MNIST, USPS, and IDS. Experiments show that our method gives higher kappa value
compared to the multiclassifier ELM ensemble. Even though AOS-ELM in practice
does not need hidden nodes increase, we address some issues related to the
increasing of the hidden nodes such as error condition and rank values. We
propose taking the rank of the pseudoinverse matrix as an indicator parameter
to detect underfitting condition.Comment: Hindawi Publishing. Computational Intelligence and Neuroscience
Volume 2016 (2016), Article ID 8091267, 17 pages Received 29 January 2016,
Accepted 17 May 2016. Special Issue on "Advances in Neural Networks and
Hybrid-Metaheuristics: Theory, Algorithms, and Novel Engineering
Applications". Academic Editor: Stefan Hauf
Evolving Ensemble Fuzzy Classifier
The concept of ensemble learning offers a promising avenue in learning from
data streams under complex environments because it addresses the bias and
variance dilemma better than its single model counterpart and features a
reconfigurable structure, which is well suited to the given context. While
various extensions of ensemble learning for mining non-stationary data streams
can be found in the literature, most of them are crafted under a static base
classifier and revisits preceding samples in the sliding window for a
retraining step. This feature causes computationally prohibitive complexity and
is not flexible enough to cope with rapidly changing environments. Their
complexities are often demanding because it involves a large collection of
offline classifiers due to the absence of structural complexities reduction
mechanisms and lack of an online feature selection mechanism. A novel evolving
ensemble classifier, namely Parsimonious Ensemble pENsemble, is proposed in
this paper. pENsemble differs from existing architectures in the fact that it
is built upon an evolving classifier from data streams, termed Parsimonious
Classifier pClass. pENsemble is equipped by an ensemble pruning mechanism,
which estimates a localized generalization error of a base classifier. A
dynamic online feature selection scenario is integrated into the pENsemble.
This method allows for dynamic selection and deselection of input features on
the fly. pENsemble adopts a dynamic ensemble structure to output a final
classification decision where it features a novel drift detection scenario to
grow the ensemble structure. The efficacy of the pENsemble has been numerically
demonstrated through rigorous numerical studies with dynamic and evolving data
streams where it delivers the most encouraging performance in attaining a
tradeoff between accuracy and complexity.Comment: this paper has been published by IEEE Transactions on Fuzzy System
A Novel Meta-Cognitive Extreme Learning Machine to Learning from Data Streams
© 2015 IEEE. Extreme Learning Machine (ELM) is an answer to an increasing demand for a low-cost learning algorithm to handle big data applications. Nevertheless, existing ELMs leave four uncharted problems: complexity, uncertainty, concept drifts, curse of dimensionality. To correct these issues, a novel incremental meta-cognitive ELM, namely Evolving Type-2 Extreme Learning Machine (eT2ELM), is proposed. Et2Elm is built upon the three pillars of meta-cognitive learning, namely what-To-learn, how-To-learn, when-To-learn, where the notion of ELM is implemented in the how-To-learn component. On the other hand, eT2ELM is driven by a generalized interval type-2 Fuzzy Neural Network (FNN) as the cognitive constituent, where the interval type-2 multivariate Gaussian function is used in the hidden layer, whereas the nonlinear Chebyshev function is embedded in the output layer. The efficacy of eT2ELM is proven with four data streams possessing various concept drifts, comparisons with prominent classifiers, and statistical tests, where eT2ELM demonstrates the most encouraging learning performances in terms of accuracy and complexity
Graph ensemble boosting for imbalanced noisy graph stream classification
© 2014 IEEE. Many applications involve stream data with structural dependency, graph representations, and continuously increasing volumes. For these applications, it is very common that their class distributions are imbalanced with minority (or positive) samples being only a small portion of the population, which imposes significant challenges for learning models to accurately identify minority samples. This problem is further complicated with the presence of noise, because they are similar to minority samples and any treatment for the class imbalance may falsely focus on the noise and result in deterioration of accuracy. In this paper, we propose a classification model to tackle imbalanced graph streams with noise. Our method, graph ensemble boosting, employs an ensemble-based framework to partition graph stream into chunks each containing a number of noisy graphs with imbalanced class distributions. For each individual chunk, we propose a boosting algorithm to combine discriminative subgraph pattern selection and model learning as a unified framework for graph classification. To tackle concept drifting in graph streams, an instance level weighting mechanism is used to dynamically adjust the instance weight, through which the boosting framework can emphasize on difficult graph samples. The classifiers built from different graph chunks form an ensemble for graph stream classification. Experiments on real-life imbalanced graph streams demonstrate clear benefits of our boosting design for handling imbalanced noisy graph stream
A survey on learning from imbalanced data streams: taxonomy, challenges, empirical study, and reproducible experimental framework
Class imbalance poses new challenges when it comes to classifying data
streams. Many algorithms recently proposed in the literature tackle this
problem using a variety of data-level, algorithm-level, and ensemble
approaches. However, there is a lack of standardized and agreed-upon procedures
on how to evaluate these algorithms. This work presents a taxonomy of
algorithms for imbalanced data streams and proposes a standardized, exhaustive,
and informative experimental testbed to evaluate algorithms in a collection of
diverse and challenging imbalanced data stream scenarios. The experimental
study evaluates 24 state-of-the-art data streams algorithms on 515 imbalanced
data streams that combine static and dynamic class imbalance ratios,
instance-level difficulties, concept drift, real-world and semi-synthetic
datasets in binary and multi-class scenarios. This leads to the largest
experimental study conducted so far in the data stream mining domain. We
discuss the advantages and disadvantages of state-of-the-art classifiers in
each of these scenarios and we provide general recommendations to end-users for
selecting the best algorithms for imbalanced data streams. Additionally, we
formulate open challenges and future directions for this domain. Our
experimental testbed is fully reproducible and easy to extend with new methods.
This way we propose the first standardized approach to conducting experiments
in imbalanced data streams that can be used by other researchers to create
trustworthy and fair evaluation of newly proposed methods. Our experimental
framework can be downloaded from
https://github.com/canoalberto/imbalanced-streams
COMPOSE: Compacted object sample extraction a framework for semi-supervised learning in nonstationary environments
An increasing number of real-world applications are associated with streaming data drawn from drifting and nonstationary distributions. These applications demand new algorithms that can learn and adapt to such changes, also known as concept drift. Proper characterization of such data with existing approaches typically requires substantial amount of labeled instances, which may be difficult, expensive, or even impractical to obtain. In this thesis, compacted object sample extraction (COMPOSE) is introduced - a computational geometry-based framework to learn from nonstationary streaming data - where labels are unavailable (or presented very sporadically) after initialization. The feasibility and performance of the algorithm are evaluated on several synthetic and real-world data sets, which present various different scenarios of initially labeled streaming environments. On carefully designed synthetic data sets, we also compare the performance of COMPOSE against the optimal Bayes classifier, as well as the arbitrary subpopulation tracker algorithm, which addresses a similar environment referred to as extreme verification latency. Furthermore, using the real-world National Oceanic and Atmospheric Administration weather data set, we demonstrate that COMPOSE is competitive even with a well-established and fully supervised nonstationary learning algorithm that receives labeled data in every batch
Learning structure and schemas from heterogeneous domains in networked systems: a survey
The rapidly growing amount of available digital documents of various formats and the possibility to access these through internet-based technologies in distributed environments, have led to the necessity to develop solid methods to properly organize and structure documents in large digital libraries and repositories. Specifically, the extremely large size of document collections make it impossible to manually organize such documents. Additionally, most of the document sexist in an unstructured form and do not follow any schemas. Therefore, research efforts in this direction are being dedicated to automatically infer structure and schemas. This is essential in order to better organize huge collections as well as to effectively and efficiently retrieve documents in heterogeneous domains in networked system. This paper presents a survey of the state-of-the-art methods for inferring structure from documents and schemas in networked environments. The survey is organized around the most important application domains, namely, bio-informatics, sensor networks, social networks, P2Psystems, automation and control, transportation and privacy preserving for which we analyze the recent developments on dealing with unstructured data in such domains.Peer ReviewedPostprint (published version
- …