3,618 research outputs found
Learning from Data Streams with Randomized Forests
Non-stationary streaming data poses a familiar challenge in machine learning: the need to
obtain fast and accurate predictions. A data stream is a continuously generated sequence of
data, with data typically arriving rapidly. They are often characterised by a non-stationary
generative process, with concept drift occurring as the process changes. Such processes are
commonly seen in the real world, such as in advertising, shopping trends, environmental
conditions, electricity monitoring and traffic monitoring.
Typical stationary algorithms are ill-suited for use with concept drifting data, thus necessitating
more targeted methods. Tree-based methods are a popular approach to this problem,
traditionally focussing on the use of the Hoeffding bound in order to guarantee performance
relative to a stationary scenario. However, there are limited single learners available for
regression scenarios, and those that do exist often struggle to choose between similarly
discriminative splits, leading to longer training times and worse performance. This limited
pool of single learners in turn hampers the performance of ensemble approaches in which
they act as base learners.
In this thesis we seek to remedy this gap in the literature, developing methods which
focus on increasing randomization to both improve predictive performance and reduce the
training times of tree-based ensemble methods. In particular, we have chosen to investigate
the use of randomization as it is known to be able to improve generalization error in
ensembles, and is also expected to lead to fast training times, thus being a natural method
of handling the problems typically experienced by single learners.
We begin in a regression scenario, introducing the Adaptive Trees for Streaming with
Extreme Randomization (ATSER) algorithm; a partially randomized approach based on
the concept of Extremely Randomized (extra) trees. The ATSER algorithm incrementally
trains trees, using the Hoeffding bound to select the best of a random selection of splits.
Simultaneously, the trees also detect and adapt to changes in the data stream. Unlike many
traditional streaming algorithms ATSER trees can easily be extended to include nominal
features. We find that compared to other contemporary methods ensembles of ATSER
trees lead to improved predictive performance whilst also reducing run times.
We then demonstrate the Adaptive Categorisation Trees for Streaming with Extreme
Randomization (ACTSER) algorithm, an adaption of the ATSER algorithm to the more
traditional categorization scenario, again showing improved predictive performance and
reduced runtimes. The inclusion of nominal features is particularly novel in this setting
since typical categorization approaches struggle to handle them.
Finally we examine a completely randomized scenario, where an ensemble of trees is generated
prior to having access to the data stream, while also considering multivariate splits
in addition to the traditional axis-aligned approach. We find that through the combination
of a forgetting mechanism in linear models and dynamic weighting for ensemble members,
we are able to avoid explicitly testing for concept drift. This leads to fast ensembles
with strong predictive performance, whilst also requiring fewer parameters than other
contemporary methods.
For each of the proposed methods in this thesis, we demonstrate empirically that they are
effective over a variety of different non-stationary data streams, including on multiple
types of concept drift. Furthermore, in comparison to other contemporary data streaming
algorithms, we find the biggest improvements in performance are on noisy data streams.Engineers Gat
Bayesian coherent analysis of in-spiral gravitational wave signals with a detector network
The present operation of the ground-based network of gravitational-wave laser
interferometers in "enhanced" configuration brings the search for gravitational
waves into a regime where detection is highly plausible. The development of
techniques that allow us to discriminate a signal of astrophysical origin from
instrumental artefacts in the interferometer data and to extract the full range
of information are some of the primary goals of the current work. Here we
report the details of a Bayesian approach to the problem of inference for
gravitational wave observations using a network of instruments, for the
computation of the Bayes factor between two hypotheses and the evaluation of
the marginalised posterior density functions of the unknown model parameters.
The numerical algorithm to tackle the notoriously difficult problem of the
evaluation of large multi-dimensional integrals is based on a technique known
as Nested Sampling, which provides an attractive alternative to more
traditional Markov-chain Monte Carlo (MCMC) methods. We discuss the details of
the implementation of this algorithm and its performance against a Gaussian
model of the background noise, considering the specific case of the signal
produced by the in-spiral of binary systems of black holes and/or neutron
stars, although the method is completely general and can be applied to other
classes of sources. We also demonstrate the utility of this approach by
introducing a new coherence test to distinguish between the presence of a
coherent signal of astrophysical origin in the data of multiple instruments and
the presence of incoherent accidental artefacts, and the effects on the
estimation of the source parameters as a function of the number of instruments
in the network.Comment: 22 page
Intelligent network intrusion detection using an evolutionary computation approach
With the enormous growth of users\u27 reliance on the Internet, the need for secure and reliable computer networks also increases. Availability of effective automatic tools for carrying out different types of network attacks raises the need for effective intrusion detection systems.
Generally, a comprehensive defence mechanism consists of three phases, namely, preparation, detection and reaction. In the preparation phase, network administrators aim to find and fix security vulnerabilities (e.g., insecure protocol and vulnerable computer systems or firewalls), that can be exploited to launch attacks. Although the preparation phase increases the level of security in a network, this will never completely remove the threat of network attacks. A good security mechanism requires an Intrusion Detection System (IDS) in order to monitor security breaches when the prevention schemes in the preparation phase are bypassed. To be able to react to network attacks as fast as possible, an automatic detection system is of paramount importance. The later an attack is detected, the less time network administrators have to update their signatures and reconfigure their detection and remediation systems. An IDS is a tool for monitoring the system with the aim of detecting and alerting intrusive activities in networks. These tools are classified into two major categories of signature-based and anomaly-based. A signature-based IDS stores the signature of known attacks in a database and discovers occurrences of attacks by monitoring and comparing each communication in the network against the database of signatures. On the other hand, mechanisms that deploy anomaly detection have a model of normal behaviour of system and any significant deviation from this model is reported as anomaly.
This thesis aims at addressing the major issues in the process of developing signature based IDSs. These are: i) their dependency on experts to create signatures, ii) the complexity of their models, iii) the inflexibility of their models, and iv) their inability to adapt to the changes in the real environment and detect new attacks. To meet the requirements of a good IDS, computational intelligence methods have attracted considerable interest from the research community.
This thesis explores a solution to automatically generate compact rulesets for network intrusion detection utilising evolutionary computation techniques. The proposed framework is called ESR-NID (Evolving Statistical Rulesets for Network Intrusion Detection). Using an interval-based structure, this method can be deployed for any continuous-valued input data. Therefore, by choosing appropriate statistical measures (i.e. continuous-valued features) of network trafc as the input to ESRNID, it can effectively detect varied types of attacks since it is not dependent on the signatures of network packets.
In ESR-NID, several innovations in the genetic algorithm were developed to keep the ruleset small. A two-stage evaluation component in the evolutionary process takes the cooperation of rules into consideration and results into very compact, easily understood rulesets. The effectiveness of this approach is evaluated against several sources of data for both detection of normal and abnormal behaviour. The results are found to be comparable to those achieved using other machine learning methods from both categories of GA-based and non-GA-based methods. One of the significant advantages of ESR-NIS is that it can be tailored to specific problem domains and the characteristics of the dataset by the use of different fitness and performance functions. This makes the system a more flexible model compared to other learning techniques. Additionally, an IDS must adapt itself to the changing environment with the least amount of configurations. ESR-NID uses an incremental learning approach as new flow of traffic become available. The incremental learning approach benefits from less required storage because it only keeps the generated rules in its database. This is in contrast to the infinitely growing size of repository of raw training data required for traditional learning
Genetic programming applied to morphological image processing
This thesis presents three approaches to the automatic design of algorithms for the processing of binary images based on the Genetic Programming (GP) paradigm. In the first approach the algorithms are designed using the basic Mathematical Morphology (MM) operators, i.e. erosion and dilation, with a variety of Structuring Elements (SEs). GP is used to design algorithms to convert a binary image into another containing just a particular characteristic of interest. In the study we have tested two similarity fitness functions, training sets with different numbers of elements and different sizes of the training images over three different objectives. The results of the first approach showed some success in the evolution of MM algorithms but also identifed problems with the amount of computational resources the method required. The second approach uses Sub-Machine-Code GP (SMCGP) and bitwise operators as an attempt to speed-up the evolution of the algorithms and to make them both feasible and effective. The SMCGP approach was successful in the speeding up of the computation but it was not successful in improving the quality of the obtained algorithms. The third approach presents the combination of logical and morphological operators in an attempt to improve the quality of the automatically designed algorithms. The results obtained provide empirical evidence showing that the evolution of high quality MM algorithms using GP is possible and that this technique has a broad potential that should be explored further. This thesis includes an analysis of the potential of GP and other Machine Learning techniques for solving the general problem of Signal Understanding by means of exploring Mathematical Morphology
JUNIPR: a Framework for Unsupervised Machine Learning in Particle Physics
In applications of machine learning to particle physics, a persistent
challenge is how to go beyond discrimination to learn about the underlying
physics. To this end, a powerful tool would be a framework for unsupervised
learning, where the machine learns the intricate high-dimensional contours of
the data upon which it is trained, without reference to pre-established labels.
In order to approach such a complex task, an unsupervised network must be
structured intelligently, based on a qualitative understanding of the data. In
this paper, we scaffold the neural network's architecture around a
leading-order model of the physics underlying the data. In addition to making
unsupervised learning tractable, this design actually alleviates existing
tensions between performance and interpretability. We call the framework
JUNIPR: "Jets from UNsupervised Interpretable PRobabilistic models". In this
approach, the set of particle momenta composing a jet are clustered into a
binary tree that the neural network examines sequentially. Training is
unsupervised and unrestricted: the network could decide that the data bears
little correspondence to the chosen tree structure. However, when there is a
correspondence, the network's output along the tree has a direct physical
interpretation. JUNIPR models can perform discrimination tasks, through the
statistically optimal likelihood-ratio test, and they permit visualizations of
discrimination power at each branching in a jet's tree. Additionally, JUNIPR
models provide a probability distribution from which events can be drawn,
providing a data-driven Monte Carlo generator. As a third application, JUNIPR
models can reweight events from one (e.g. simulated) data set to agree with
distributions from another (e.g. experimental) data set.Comment: 37 pages, 24 figure
The Next Generation Virgo Cluster Survey - Infrared (NGVS-IR): I. A new Near-UV/Optical/Near-IR Globular Cluster selection tool
The NGVS-IR project (Next Generation Virgo Survey - Infrared) is a contiguous
near-infrared imaging survey of the Virgo cluster of galaxies. It complements
the optical wide-field survey of Virgo (NGVS). The current state of NGVS-IR
consists of Ks-band imaging of 4 deg^2 centered on M87, and J and Ks-band
imaging of 16 deg^2 covering the region between M49 and M87. In this paper, we
present the observations of the central 4 deg^2 centered on Virgo's core
region. The data were acquired with WIRCam on the Canada-France-Hawaii
Telescope and the total integration time was 41 hours distributed in 34
contiguous tiles. A survey-specific strategy was designed to account for
extended galaxies while still measuring accurate sky brightness within the
survey area. The average 5\sigma limiting magnitude is Ks=24.4 AB mag and the
50% completeness limit is Ks=23.75 AB mag for point source detections, when
using only images with better than 0.7" seeing (median seeing 0.54"). Star
clusters are marginally resolved in these image stacks, and Virgo galaxies with
\mu_Ks=24.4 AB mag arcsec^-2 are detected. Combining the Ks data with optical
and ultraviolet data, we build the uiK color-color diagram which allows a very
clean color-based selection of globular clusters in Virgo. This diagnostic plot
will provide reliable globular cluster candidates for spectroscopic follow-up
campaigns needed to continue the exploration of Virgo's photometric and
kinematic sub-structures, and will help the design of future searches for
globular clusters in extragalactic systems. Equipped with this powerful new
tool, future NGVS-IR investigations based on the uiK diagram will address the
mapping and analysis of extended structures and compact stellar systems in and
around Virgo galaxies.Comment: 23 pages, 18 figures. Accepted for publication in ApJ
Metabolomics : a tool for studying plant biology
In recent years new technologies have allowed gene expression, protein and metabolite profiles in different tissues and developmental stages to be monitored. This is an emerging field in plant science and is applied to diverse plant systems in order to elucidate the regulation of growth and development. The goal in plant metabolomics is to analyze, identify and quantify all low molecular weight molecules of plant organisms. The plant metabolites are extracted and analyzed using various sensitive analytical techniques, usually mass spectrometry (MS) in combination with chromatography. In order to compare the metabolome of different plants in a high through-put manner, a number of biological, analytical and data processing steps have to be performed. In the work underlying this thesis we developed a fast and robust method for routine analysis of plant metabolite patterns using Gas Chromatography-Mass Spectrometry (GC/MS). The method was performed according to Design of Experiment (DOE) to investigate factors affecting the extraction and derivatization of the metabolites from leaves of the plant Arabidopsis thaliana. The outcome of metabolic analysis by GC/MS is a complex mixture of approximately 400 overlapping peaks. Resolving (deconvoluting) overlapping peaks is time-consuming, difficult to automate and additional processing is needed in order to compare samples. To avoid deconvolution being a major bottleneck in high through-put analyses we developed a new semi-automated strategy using hierarchical methods for processing GC/MS data that can be applied to all samples simultaneously. The two methods include base-line correction of the non-processed MS-data files, alignment, time-window determinations, Alternating Regression and multivariate analysis in order to detect metabolites that differ in relative concentrations between samples. The developed methodology was applied to study the effects of the plant hormone GA on the metabolome, with specific emphasis on auxin levels in Arabidopsis thaliana mutants defective in GA biosynthesis and signalling. A large series of plant samples was analysed and the resulting data were processed in less than one week with minimal labour; similar to the time required for the GC/MS analyses of the samples
Probabilistic Models for Joint Segmentation, Detection and Tracking
Migrace buněk a buněčných částic hraje důležitou roli ve fungování živých organismů. Systematický výzkum buněčné migrace byl umožněn v posledních dvaceti letech rychlým rozvojem neinvazivních zobrazovacích technik a digitálních snímačů. Moderní zobrazovací systémy dovolují studovat chování buněčných populací složených z mnoha ticíců buněk. Manuální analýza takového množství dat by byla velice zdlouhavá, protože některé experimenty vyžadují analyzovat tvar, rychlost a další charakteristiky jednotlivých buněk. Z tohoto důvodu je ve vědecké komunitě velká poptávka po automatických metodách.Migration of cells and subcellular particles plays a crucial role in many processes in living organisms. Despite its importance a systematic research of cell motility has only been possible in last two decades due to rapid development of non-invasive imaging techniques and digital cameras. Modern imaging systems allow to study large populations with thousands of cells. Manual analysis of the acquired data is infeasible, because in order to gain insight into underlying biochemical processes it is sometimes necessary to determine shape, velocity and other characteristics of individual cells. Thus there is a high demand for automatic methods
Low-frequency gravitational-wave science with eLISA/NGO
We review the expected science performance of the New Gravitational-Wave
Observatory (NGO, a.k.a. eLISA), a mission under study by the European Space
Agency for launch in the early 2020s. eLISA will survey the low-frequency
gravitational-wave sky (from 0.1 mHz to 1 Hz), detecting and characterizing a
broad variety of systems and events throughout the Universe, including the
coalescences of massive black holes brought together by galaxy mergers; the
inspirals of stellar-mass black holes and compact stars into central galactic
black holes; several millions of ultracompact binaries, both detached and mass
transferring, in the Galaxy; and possibly unforeseen sources such as the relic
gravitational-wave radiation from the early Universe. eLISA's high
signal-to-noise measurements will provide new insight into the structure and
history of the Universe, and they will test general relativity in its
strong-field dynamical regime.Comment: 20 pages, 8 figures, proceedings of the 9th Amaldi Conference on
Gravitational Waves. Final journal version. For a longer exposition of the
eLISA science case, see http://arxiv.org/abs/1201.362
- …