9 research outputs found
An ensemble learning framework for anomaly detection in building energy consumption
During building operation, a significant amount of energy is wasted due to equipment and human-related faults. To reduce waste, today\u27s smart buildings monitor energy usage with the aim of identifying abnormal consumption behaviour and notifying the building manager to implement appropriate energy-saving procedures. To this end, this research proposes a new pattern-based anomaly classifier, the collective contextual anomaly detection using sliding window (CCAD-SW) framework. The CCAD-SW framework identifies anomalous consumption patterns using overlapping sliding windows. To enhance the anomaly detection capacity of the CCAD-SW, this research also proposes the ensemble anomaly detection (EAD) framework. The EAD is a generic framework that combines several anomaly detection classifiers using majority voting. To ensure diversity of anomaly classifiers, the EAD is implemented by combining pattern-based (e.g., CCAD-SW) and prediction-based anomaly classifiers. The research was evaluated using real-world data provided by Powersmiths, located in Brampton, Ontario, Canada. Results show that the EAD framework improved the sensitivity of the CCAD-SW by 3.6% and reduced false alarm rate by 2.7%
Collective Contextual Anomaly Detection for Building Energy Consumption
Commercial and residential buildings are responsible for a substantial portion of total global energy consumption and as a result make a significant contribution to global carbon emissions. Hence, energy-saving goals that target buildings can have a major impact in reducing environmental damage. During building operation, a significant amount of energy is wasted due to equipment and human-related faults. To reduce waste, today\u27s smart buildings monitor energy usage with the aim of identifying abnormal consumption behaviour and notifying the building manager to implement appropriate energy-saving procedures. To this end, this research proposes the \textit{ensemble anomaly detection} (EAD) framework. The EAD is a generic framework that combines several anomaly detection classifiers using majority voting. This anomaly detection classifiers are formed using existing machine learning algorithm. It is assumed that each anomaly classifier has equal weight. More importantly, to ensure diversity of anomaly classifiers, the EAD is implemented by combining pattern-based and prediction-based anomaly classifiers. For this reason, this research also proposes a new pattern-based anomaly classifier, the \textit{collective contextual anomaly detection using sliding window} (CCAD-SW) framework. The CCAD-SW, which is also a machine leaning-based framework that identifies anomalous consumption patterns using overlapping sliding windows. The EAD framework combines the CCAD-SW, which is implemented using autoencoder, with two prediction-based anomaly classifiers that are implemented using the support vector regression and random forest machine-learning algorithms. In addition, it determines an ensemble threshold that yields an anomaly classifier with optimal anomaly detection capability and false positive minimization. Results show that the EAD performs better than the individual anomaly detection classifiers. In the EAD framework, the optimal ensemble anomaly classifier is not attained by combining the individual learners at their respective optimal performance levels. Instead, an ensemble threshold combination that yields the optimal anomaly classifier was identified by searching through the ensemble threshold space. The research was evaluated using real-world data provided by Powersmiths, located in Brampton, Ontario, Canada
Ensemble Methods for Anomaly Detection
Anomaly detection has many applications in numerous areas such as intrusion detection, fraud detection, and medical diagnosis. Most current techniques are specialized for detecting one type of anomaly, and work well on specific domains and when the data satisfies specific assumptions.
We address this problem, proposing ensemble anomaly detection techniques that perform well in many applications, with four major contributions: using bootstrapping to better detect anomalies on multiple subsamples, sequential application of diverse detection
algorithms, a novel adaptive sampling and learning algorithm in which the anomalies are iteratively examined, and improving the random forest algorithms for detecting anomalies in streaming data.
We design and evaluate multiple ensemble strategies using score normalization, rank aggregation and majority voting, to combine the results from six well-known base algorithms. We propose a bootstrapping algorithm in which anomalies are evaluated from multiple subsets of the data. Results show that our independent ensemble performs better than the base algorithms, and using bootstrapping achieves competitive quality and faster runtime compared with existing works.
We develop new sequential ensemble algorithms in which the second algorithm performs anomaly detection based on the first algorithm\u27s outputs; best results are obtained by combining algorithms that are substantially different. We propose a novel adaptive sampling algorithm which uses the score output of the base algorithm to determine the hard-to-detect examples, and iteratively resamples more points from such examples in a complete unsupervised context.
On streaming datasets, we analyze the impact of parameters used in random trees, and propose new algorithms that work well with high-dimensional data, improving performance without increasing the number of trees or their heights. We show that further improvements can be obtained with an Evolutionary Algorithm
EDMON - Electronic Disease Surveillance and Monitoring Network: A Personalized Health Model-based Digital Infectious Disease Detection Mechanism using Self-Recorded Data from People with Type 1 Diabetes
Through time, we as a society have been tested with infectious disease outbreaks of different magnitude, which often pose major public health challenges. To mitigate the challenges, research endeavors have been focused on early detection mechanisms through identifying potential data sources, mode of data collection and transmission, case and outbreak detection methods. Driven by the ubiquitous nature of smartphones and wearables, the current endeavor is targeted towards individualizing the surveillance effort through a personalized health model, where the case detection is realized by exploiting self-collected physiological data from wearables and smartphones.
This dissertation aims to demonstrate the concept of a personalized health model as a case detector for outbreak detection by utilizing self-recorded data from people with type 1 diabetes. The results have shown that infection onset triggers substantial deviations, i.e. prolonged hyperglycemia regardless of higher insulin injections and fewer carbohydrate consumptions. Per the findings, key parameters such as blood glucose level, insulin, carbohydrate, and insulin-to-carbohydrate ratio are found to carry high discriminative power. A personalized health model devised based on a one-class classifier and unsupervised method using selected parameters achieved promising detection performance. Experimental results show the superior performance of the one-class classifier and, models such as one-class support vector machine, k-nearest neighbor and, k-means achieved better performance. Further, the result also revealed the effect of input parameters, data granularity, and sample sizes on model performances.
The presented results have practical significance for understanding the effect of infection episodes amongst people with type 1 diabetes, and the potential of a personalized health model in outbreak detection settings. The added benefit of the personalized health model concept introduced in this dissertation lies in its usefulness beyond the surveillance purpose, i.e. to devise decision support tools and learning platforms for the patient to manage infection-induced crises
An Ensemble Self-Structuring Neural Network Approach to Solving Classification Problems with Virtual Concept Drift and its Application to Phishing Websites
Classification in data mining is one of the well-known tasks that aim to construct a
classification model from a labelled input data set. Most classification models are
devoted to a static environment where the complete training data set is presented to the
classification algorithm. This data set is assumed to cover all information needed to
learn the pertinent concepts (rules and patterns) related to how to classify unseen
examples to predefined classes. However, in dynamic (non-stationary) domains, the set
of features (input data attributes) may change over time. For instance, some features
that are considered significant at time Ti might become useless or irrelevant at time Ti+j.
This situation results in a phenomena called Virtual Concept Drift. Yet, the set of
features that are dropped at time Ti+j might return to become significant again in the
future. Such a situation results in the so-called Cyclical Concept Drift, which is a direct
result of the frequently called catastrophic forgetting dilemma. Catastrophic forgetting
happens when the learning of new knowledge completely removes the previously
learned knowledge.
Phishing is a dynamic classification problem where a virtual concept drift might occur.
Yet, the virtual concept drift that occurs in phishing might be guided by some
malevolent intelligent agent rather than occurring naturally. One reason why phishers
keep changing the features combination when creating phishing websites might be that
they have the ability to interpret the anti-phishing tool and thus they pick a new set of
features that can circumvent it. However, besides the generalisation capability, fault
tolerance, and strong ability to learn, a Neural Network (NN) classification model is
considered as a black box. Hence, if someone has the skills to hack into the NN based
classification model, he might face difficulties to interpret and understand how the NN
processes the input data in order to produce the final decision (assign class value).
In this thesis, we investigate the problem of virtual concept drift by proposing a
framework that can keep pace with the continuous changes in the input features. The
proposed framework has been applied to phishing websites classification problem and
it shows competitive results with respect to various evaluation measures (Harmonic
Mean (F1-score), precision, accuracy, etc.) when compared to several other data mining
techniques. The framework creates an ensemble of classifiers (group of classifiers) and it
offers a balance between stability (maintaining previously learned knowledge) and
plasticity (learning knowledge from the newly offered training data set). Hence, the
framework can also handle the cyclical concept drift. The classifiers that constitute the
ensemble are created using an improved Self-Structuring Neural Networks algorithm
(SSNN). Traditionally, NN modelling techniques rely on trial and error, which is a
tedious and time-consuming process. The SSNN simplifies structuring NN classifiers
with minimum intervention from the user. The framework evaluates the ensemble
whenever a new data set chunk is collected. If the overall accuracy of the combined
results from the ensemble drops significantly, a new classifier is created using the SSNN
and added to the ensemble. Overall, the experimental results show that the proposed
framework affords a balance between stability and plasticity and can effectively handle
the virtual concept drift when applied to phishing websites classification problem. Most
of the chapters of this thesis have been subject to publicatio
Ensemble methods in intrusion detection
As services are being deployed on the internet, there is the need to secure the infrastructure from malicious attacks. Intrusion detection serves as a second line of defense apart from firewall and cryptography. There are many techniques employed in intrusion detection which include signature detection, anomaly and specification based detection system. These techniques often trade off accuracy with false positive rate. In this study, anomaly detection using ensembles is used to automatically classify and detect attack patterns. It has been proven that ensembles of classifier outperform their base classifiers. Several multiples of classifiers have been combined to improve the performance of intrusion detection system. Commonly used classifiers include Support Vector Machines, Decision Trees, Genetic Algorithms, Fuzzy, Principal Component Analysis. The study employed KStar clustering and Instance Based classification algorithms to detect intrusions in NSL-KDD dataset. The results show that the ensemble we designed has a 1-error rate of 99.67% and false positive 0.33%. The response time of the anomaly is 0.18seconds. The chosen ensemble outperformed the rest of the ensembles (rPART & SMO and J48) and the base classifiers. The performance of the combiners has showed that the study has built a model with high detection, and reduced error