324 research outputs found
Tie-breaking in Hoeffding trees
A thorough examination of the performance of Hoeffding trees, state-of-the-art in classification for data streams, on a range of datasets reveals that tie breaking, an essential but supposedly rare procedure, is employed much more than expected. Testing with a lightweight method for handling continuous attributes, we find that the excessive invocation of tie breaking causes performance to degrade significantly on complex and noisy data. Investigating ways to reduce the number of tie breaks, we propose an adaptive method that overcomes the problem while not significantly affecting performance on simpler datasets
Recommended from our members
Application of Advanced Early Warning Systems with Adaptive Protection
This project developed and field-tested two methods of Adaptive Protection systems utilizing synchrophasor data. One method detects conditions of system stress that can lead to unintended relay operation, and initiates a supervisory signal to modify relay response in real time to avoid false trips. The second method detects the possibility of false trips of impedance relays as stable system swings “encroach” on the relays’ impedance zones, and produces an early warning so that relay engineers can re-evaluate relay settings. In addition, real-time synchrophasor data produced by this project was used to develop advanced visualization techniques for display of synchrophasor data to utility operators and engineers
Activity recognition from smartphone sensing data
Tese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201
Reservoir of Diverse Adaptive Learners and Stacking Fast Hoeffding Drift Detection Methods for Evolving Data Streams
The last decade has seen a surge of interest in adaptive learning algorithms
for data stream classification, with applications ranging from predicting ozone
level peaks, learning stock market indicators, to detecting computer security
violations. In addition, a number of methods have been developed to detect
concept drifts in these streams. Consider a scenario where we have a number of
classifiers with diverse learning styles and different drift detectors.
Intuitively, the current 'best' (classifier, detector) pair is application
dependent and may change as a result of the stream evolution. Our research
builds on this observation. We introduce the \mbox{Tornado} framework that
implements a reservoir of diverse classifiers, together with a variety of drift
detection algorithms. In our framework, all (classifier, detector) pairs
proceed, in parallel, to construct models against the evolving data streams. At
any point in time, we select the pair which currently yields the best
performance. We further incorporate two novel stacking-based drift detection
methods, namely the \mbox{FHDDMS} and \mbox{FHDDMS}_{add} approaches. The
experimental evaluation confirms that the current 'best' (classifier, detector)
pair is not only heavily dependent on the characteristics of the stream, but
also that this selection evolves as the stream flows. Further, our
\mbox{FHDDMS} variants detect concept drifts accurately in a timely fashion
while outperforming the state-of-the-art.Comment: 42 pages, and 14 figure
Scalable real-time classification of data streams with concept drift
Inducing adaptive predictive models in real-time from high throughput data streams is one of the most challenging areas of Big Data Analytics. The fact that data streams may contain concept drifts (changes of the pattern encoded in the stream over time) and are unbounded, imposes unique challenges in comparison with predictive data mining from batch data. Several real-time predictive data stream algorithms exist, however, most approaches are not naturally parallel and thus limited in their scalability. This paper highlights the Micro-Cluster Nearest Neighbour (MC-NN) data stream classifier. MC-NN is based on statistical summaries of the data stream and a nearest neighbour approach, which makes MC-NN naturally parallel. In its serial version MC-NN is able to handle data streams, the data does not need to reside in memory and is processed incrementally. MC-NN is also able to adapt to concept drifts. This paper provides an empirical study on the serial algorithm’s speed, adaptivity and accuracy. Furthermore, this paper discusses the new parallel implementation of MC-NN, its parallel properties and provides an empirical scalability study
Degree-based goodness-of-fit tests for heterogeneous random graph models : independent and exchangeable cases
The degrees are a classical and relevant way to study the topology of a
network. They can be used to assess the goodness-of-fit for a given random
graph model. In this paper we introduce goodness-of-fit tests for two classes
of models. First, we consider the case of independent graph models such as the
heterogeneous Erd\"os-R\'enyi model in which the edges have different
connection probabilities. Second, we consider a generic model for exchangeable
random graphs called the W-graph. The stochastic block model and the expected
degree distribution model fall within this framework. We prove the asymptotic
normality of the degree mean square under these independent and exchangeable
models and derive formal tests. We study the power of the proposed tests and we
prove the asymptotic normality under specific sparsity regimes. The tests are
illustrated on real networks from social sciences and ecology, and their
performances are assessed via a simulation study
Symmetric Rank Covariances: a Generalised Framework for Nonparametric Measures of Dependence
The need to test whether two random vectors are independent has spawned a
large number of competing measures of dependence. We are interested in
nonparametric measures that are invariant under strictly increasing
transformations, such as Kendall's tau, Hoeffding's D, and the more recently
discovered Bergsma--Dassios sign covariance. Each of these measures exhibits
symmetries that are not readily apparent from their definitions. Making these
symmetries explicit, we define a new class of multivariate nonparametric
measures of dependence that we refer to as Symmetric Rank Covariances. This new
class generalises all of the above measures and leads naturally to multivariate
extensions of the Bergsma--Dassios sign covariance. Symmetric Rank Covariances
may be estimated unbiasedly using U-statistics for which we prove results on
computational efficiency and large-sample behavior. The algorithms we develop
for their computation include, to the best of our knowledge, the first
efficient algorithms for the well-known Hoeffding's D statistic in the
multivariate setting
Hoeffding Tree Algorithms for Anomaly Detection in Streaming Datasets: A Survey
This survey aims to deliver an extensive and well-constructed overview of using machine learning for the problem of detecting anomalies in streaming datasets. The objective is to provide the effectiveness of using Hoeffding Trees as a machine learning algorithm solution for the problem of detecting anomalies in streaming cyber datasets. In this survey we categorize the existing research works of Hoeffding Trees which can be feasible for this type of study into the following: surveying distributed Hoeffding Trees, surveying ensembles of Hoeffding Trees and surveying existing techniques using Hoeffding Trees for anomaly detection. These categories are referred to as compositions within this paper and were selected based on their relation to streaming data and the flexibility of their techniques for use within different domains of streaming data. We discuss the relevance of how combining the techniques of the proposed research works within these compositions can be used to address the anomaly detection problem in streaming cyber datasets. The goal is to show how a combination of techniques from different compositions can solve a prominent problem, anomaly detection
- …