56 research outputs found
Data Stream Classification Using Classifier Ensemble
For the contemporary business, the crucial factor is making smart decisions on the basis of the knowledge hidden in stored data. Unfortunately,m traditional simple methods of data analysis are not sufficient for efficient management of modern enterprizes, because they are not appropriate for the huge and growing amount of the data stored by them. Additionally data usually comes continuously in the form of so-called data stream. The great disadvantage of traditional classification methods is that they assume that statistical properties of the discovered concept are being unchanged, while in real situation, we could observe so-called concept drift, which could be caused by changes in the probabilities of classes or/and conditional probability distributions of classes. The potential for considering new training data is an important feature of machine learning methods used in security applications (spam filtering or intrusion detection) or decision support systems for marketing departments, which need to follow the changing client behavior. Unfortunately, the occurrence of concept drift dramatically decreases classification accuracy. This work presents the comprehensive study on the ensemble classifier approach applied to the problem of drifted data streams. Especially it reports the research on modifications of previously developed Weighted Aging Classifier Ensemble (WAE) algorithm, which is able to construct a valuable classifier ensemble for classification of incremental drifted stream data. We generalize WAE method and propose the general framework for this approach. Such framework can prune an classifier ensemble before or after assigning weights to individual classifiers. Additionally, we propose new classifier pruning criteria, weight calculation methods, and aging operators. We also propose rejuvenating operator, which is able to soften the aging effect, which could be useful, especially in the case if quite âoldâ classifiers are high quality models, i.e., their presence increases ensemble accuracy, what could be found, e.g., in the case of recurring concept drift. The chosen characteristics of the proposed frameworks were evaluated on the basis of the wide range of computer experiments carried out on the two benchmark data streams. Obtained results confirmed the usability of proposed method to the data stream classification with the presence of incremental concept drift
DynED: Dynamic Ensemble Diversification in Data Stream Classification
Ensemble methods are commonly used in classification due to their remarkable
performance. Achieving high accuracy in a data stream environment is a
challenging task considering disruptive changes in the data distribution, also
known as concept drift. A greater diversity of ensemble components is known to
enhance prediction accuracy in such settings. Despite the diversity of
components within an ensemble, not all contribute as expected to its overall
performance. This necessitates a method for selecting components that exhibit
high performance and diversity. We present a novel ensemble construction and
maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically
combines the diversity and prediction accuracy of components during the process
of structuring an ensemble. The experimental results on both four real and 11
synthetic datasets demonstrate that the proposed approach (DynED) provides a
higher average mean accuracy compared to the five state-of-the-art baselines.Comment: Proceedings of the 32nd ACM International Conference on Information
and Knowledge Management (CIKM '23), October 21--25, 2023, Birmingham, United
Kingdo
Always Strengthen Your Strengths: A Drift-Aware Incremental Learning Framework for CTR Prediction
Click-through rate (CTR) prediction is of great importance in recommendation
systems and online advertising platforms. When served in industrial scenarios,
the user-generated data observed by the CTR model typically arrives as a
stream. Streaming data has the characteristic that the underlying distribution
drifts over time and may recur. This can lead to catastrophic forgetting if the
model simply adapts to new data distribution all the time. Also, it's
inefficient to relearn distribution that has been occurred. Due to memory
constraints and diversity of data distributions in large-scale industrial
applications, conventional strategies for catastrophic forgetting such as
replay, parameter isolation, and knowledge distillation are difficult to be
deployed. In this work, we design a novel drift-aware incremental learning
framework based on ensemble learning to address catastrophic forgetting in CTR
prediction. With explicit error-based drift detection on streaming data, the
framework further strengthens well-adapted ensembles and freezes ensembles that
do not match the input distribution avoiding catastrophic interference. Both
evaluations on offline experiments and A/B test shows that our method
outperforms all baselines considered.Comment: This work has been accepted by SIGIR2
Dynamic adversarial mining - effectively applying machine learning in adversarial non-stationary environments.
While understanding of machine learning and data mining is still in its budding stages, the engineering applications of the same has found immense acceptance and success. Cybersecurity applications such as intrusion detection systems, spam filtering, and CAPTCHA authentication, have all begun adopting machine learning as a viable technique to deal with large scale adversarial activity. However, the naive usage of machine learning in an adversarial setting is prone to reverse engineering and evasion attacks, as most of these techniques were designed primarily for a static setting. The security domain is a dynamic landscape, with an ongoing never ending arms race between the system designer and the attackers. Any solution designed for such a domain needs to take into account an active adversary and needs to evolve over time, in the face of emerging threats. We term this as the âDynamic Adversarial Miningâ problem, and the presented work provides the foundation for this new interdisciplinary area of research, at the crossroads of Machine Learning, Cybersecurity, and Streaming Data Mining. We start with a white hat analysis of the vulnerabilities of classification systems to exploratory attack. The proposed âSeed-Explore-Exploitâ framework provides characterization and modeling of attacks, ranging from simple random evasion attacks to sophisticated reverse engineering. It is observed that, even systems having prediction accuracy close to 100%, can be easily evaded with more than 90% precision. This evasion can be performed without any information about the underlying classifier, training dataset, or the domain of application. Attacks on machine learning systems cause the data to exhibit non stationarity (i.e., the training and the testing data have different distributions). It is necessary to detect these changes in distribution, called concept drift, as they could cause the prediction performance of the model to degrade over time. However, the detection cannot overly rely on labeled data to compute performance explicitly and monitor a drop, as labeling is expensive and time consuming, and at times may not be a possibility altogether. As such, we propose the âMargin Density Drift Detection (MD3)â algorithm, which can reliably detect concept drift from unlabeled data only. MD3 provides high detection accuracy with a low false alarm rate, making it suitable for cybersecurity applications; where excessive false alarms are expensive and can lead to loss of trust in the warning system. Additionally, MD3 is designed as a classifier independent and streaming algorithm for usage in a variety of continuous never-ending learning systems. We then propose a âDynamic Adversarial Miningâ based learning framework, for learning in non-stationary and adversarial environments, which provides âsecurity by designâ. The proposed âPredict-Detectâ classifier framework, aims to provide: robustness against attacks, ease of attack detection using unlabeled data, and swift recovery from attacks. Ideas of feature hiding and obfuscation of feature importance are proposed as strategies to enhance the learning framework\u27s security. Metrics for evaluating the dynamic security of a system and recover-ability after an attack are introduced to provide a practical way of measuring efficacy of dynamic security strategies. The framework is developed as a streaming data methodology, capable of continually functioning with limited supervision and effectively responding to adversarial dynamics. The developed ideas, methodology, algorithms, and experimental analysis, aim to provide a foundation for future work in the area of âDynamic Adversarial Miningâ, wherein a holistic approach to machine learning based security is motivated
Continual Learning in Medical Image Analysis: A Comprehensive Review of Recent Advancements and Future Prospects
Medical imaging analysis has witnessed remarkable advancements even
surpassing human-level performance in recent years, driven by the rapid
development of advanced deep-learning algorithms. However, when the inference
dataset slightly differs from what the model has seen during one-time training,
the model performance is greatly compromised. The situation requires restarting
the training process using both the old and the new data which is
computationally costly, does not align with the human learning process, and
imposes storage constraints and privacy concerns. Alternatively, continual
learning has emerged as a crucial approach for developing unified and
sustainable deep models to deal with new classes, tasks, and the drifting
nature of data in non-stationary environments for various application areas.
Continual learning techniques enable models to adapt and accumulate knowledge
over time, which is essential for maintaining performance on evolving datasets
and novel tasks. This systematic review paper provides a comprehensive overview
of the state-of-the-art in continual learning techniques applied to medical
imaging analysis. We present an extensive survey of existing research, covering
topics including catastrophic forgetting, data drifts, stability, and
plasticity requirements. Further, an in-depth discussion of key components of a
continual learning framework such as continual learning scenarios, techniques,
evaluation schemes, and metrics is provided. Continual learning techniques
encompass various categories, including rehearsal, regularization,
architectural, and hybrid strategies. We assess the popularity and
applicability of continual learning categories in various medical sub-fields
like radiology and histopathology..
Online transfer learning for concept drifting data streams
Online Transfer Learning (TL) allows knowledge to be learnt from a data rich source domain to aid predictions in an online target domain. However, when all domains are online, and a data rich source domain does not exist, we must determine what to transfer, how to combine transferred knowledge, and whether to transfer knowledge. To ensure the feasibility of online TL methods in real-world applications, they should not only aid predictions in receiving domains, but should consider the communication and computational overheads of knowledge transfer. To address these challenges, this thesis presents methods for online TL when all domains are online, which are evaluated using synthetic and real-world regression-based datasets. First, the BOTL framework is introduced, which enables knowledge transfer to be conducted bi-directionally between online data streams, where knowledge is transferred in the form of predictive models, and combined using an OLS metalearner. Second, two methods of selecting a relevant yet diverse subset of transferred and locally learnt models are presented, namely parameterised thresholding and conceptual clustering. These approaches help to prevent over_tting when the number of models transferred is large in comparison to the window of available data. To reduce the computational overhead of selecting subsets of models, a static diversity metric is introduced, which estimates the conceptual similarity between models using the Principal Angles (PAs) between their underlying subspaces. Third, two methods for determining whether to transfer knowledge are presented, namely IdDT and IdCS, which maintain comparable predictive performances to when all models are transferred, while reducing the number of models received in each domain by 47:1% and 30% respectively across the experiments conducted for this thesis
End-to-end anomaly detection in stream data
Nowadays, huge volumes of data are generated with increasing velocity through various systems, applications, and activities. This increases the demand for stream and time series analysis to react to changing conditions in real-time for enhanced efficiency and quality of service delivery as well as upgraded safety and security in private and public sectors. Despite its very rich history, time series anomaly detection is still one of the vital topics in machine learning research and is receiving increasing attention. Identifying hidden patterns and selecting an appropriate model that fits the observed data well and also carries over to unobserved data is not a trivial task. Due to the increasing diversity of data sources and associated stochastic processes, this pivotal data analysis topic is loaded with various challenges like complex latent patterns, concept drift, and overfitting that may mislead the model and cause a high false alarm rate. Handling these challenges leads the advanced anomaly detection methods to develop sophisticated decision logic, which turns them into mysterious and inexplicable black-boxes. Contrary to this trend, end-users expect transparency and verifiability to trust a model and the outcomes it produces. Also, pointing the users to the most anomalous/malicious areas of time series and causal features could save them time, energy, and money. For the mentioned reasons, this thesis is addressing the crucial challenges in an end-to-end pipeline of stream-based anomaly detection through the three essential phases of behavior prediction, inference, and interpretation. The first step is focused on devising a time series model that leads to high average accuracy as well as small error deviation. On this basis, we propose higher-quality anomaly detection and scoring techniques that utilize the related contexts to reclassify the observations and post-pruning the unjustified events. Last but not least, we make the predictive process transparent and verifiable by providing meaningful reasoning behind its generated results based on the understandable concepts by a human. The provided insight can pinpoint the anomalous regions of time series and explain why the current status of a system has been flagged as anomalous. Stream-based anomaly detection research is a principal area of innovation to support our economy, security, and even the safety and health of societies worldwide. We believe our proposed analysis techniques can contribute to building a situational awareness platform and open new perspectives in a variety of domains like cybersecurity, and health
- âŠ