3,054 research outputs found
Online semi-supervised learning in non-stationary environments
Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and
balanced data, immediately or after some delay, to extract worthwhile knowledge from the
continuous and rapid data streams. However, in many real-world applications such as
Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer
Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of
Things sensors and real-time data on the Internet. Manual labelling of these data streams
is not practical due to time consumption and the need for domain expertise. Another
challenge is learning under Non-Stationary Environments (NSEs), which occurs due to
changes in the data distributions in a set of input variables and/or class labels. The problem
of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms
have no access to the true class labels directly when the concept evolves. Several approaches
exist that deal with NSE and EVL in isolation. However, few algorithms address both issues
simultaneously. This research directly responds to ILNSEâs challenge in proposing two
novel algorithms âPredictor for Streaming Data with Scarce Labelsâ (PSDSL) and
Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label
scarcity issues in online machine learning.
The key capabilities of PSDSL include learning from a small amount of labelled data in an
incremental or online manner and being available to predict at any time. To achieve this,
PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it
continuously learns from incoming data and updates the model as new labelled or
unlabelled data becomes available over time. Furthermore, it can predict under NSE
conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier,
which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch
and adapt to the conditions. The PSDSL adapts to learning states between self-learning,
micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of
the data stream. HDWM makes use of âseedâ learners of different types in an ensemble to
maintain its diversity. The ensembles are simply the combination of predictive models
grouped to improve the predictive performance of a single classifier.
PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification
on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than
existing approaches on most real-time data streams including randomised data instances.
PSDSL performed significantly better than âStaticâ i.e. the classifier is not updated after it is
trained with the first examples in the data streams. When applied to MOA-generated data
streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC,
while SCARGC performed the same as the Static. PSDSL achieved better average prediction
accuracies in a short time than SCARGC.
The HDWM algorithm is evaluated on artificial and real-world data streams against existing
well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic
DWM algorithm. The results showed that HDWM performed significantly better than WMA
and DWM. Also, when recurring concept drifts were present, the predictive performance of
HDWM showed an improvement over DWM. In both drift and real-world streams,
significance tests and post hoc comparisons found significant differences between
algorithms, HDWM performed significantly better than DWM and WMA when applied to
MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The
seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms
benefit from the use of both forgetting and retaining the models. The algorithm also
provides the independence of selecting the optimal base classifier in its ensemble depending
on the problem.
A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts
during the cluster labelling process. In this process, PSDSL transforms the centroidsâ
information of micro-clusters into micro-instances and generates new clusters called
Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and
successfully guide the cluster labelling process after the concept drifts in the absence of true
class labels. PSDSL has been evaluated on real-world problem âkeystroke dynamicsâ, and
the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC
(81.6%), while the Static (49.0%) significantly degrades the performance due to changes in
the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found
highly fluctuated between (41.1% to 81.6%) based on different values of parameter âkâ
(number of clusters), while PSDSL automatically determine the best values for this
parameter
Medical Image Analysis using Deep Relational Learning
In the past ten years, with the help of deep learning, especially the rapid
development of deep neural networks, medical image analysis has made remarkable
progress. However, how to effectively use the relational information between
various tissues or organs in medical images is still a very challenging
problem, and it has not been fully studied. In this thesis, we propose two
novel solutions to this problem based on deep relational learning. First, we
propose a context-aware fully convolutional network that effectively models
implicit relation information between features to perform medical image
segmentation. The network achieves the state-of-the-art segmentation results on
the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain
Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new
hierarchical homography estimation network to achieve accurate medical image
mosaicing by learning the explicit spatial relationship between adjacent
frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and
our hierarchical homography estimation network outperforms the other
state-of-the-art mosaicing methods while generating robust and meaningful
mosaicing result on unseen frames.Comment: arXiv admin note: substantial text overlap with arXiv:2007.0778
From Human Behavior to Machine Behavior
A core pursuit of artificial intelligence is the comprehension of human behavior. Imbuing intelligent agents with a good human behavior model can help them understand how to behave intelligently and interactively in complex situations. Due to the increase in data availability and computational resources, the development of machine learning algorithms for duplicating human cognitive abilities has made rapid progress. To solve difficult scenarios, learning-based methods must search for solutions in a predefined but large space. Along with implementing a smart exploration strategy, the right representation for a task can help narrow the search process during learning. This dissertation tackles three important aspects of machine intelligence: 1) prediction, 2) exploration, and 3) representation. More specifically we develop new algorithms for 1) predicting the future maneuvers or outcomes in pilot training and computer architecture applications; 2) exploration strategies for reinforcement learning in game environments and 3) scene representations for autonomous driving agents capable of handling large numbers of dynamic entities. This dissertation makes the following research contributions in the area of representation learning. First, we introduce a new time series representation for flight trajectories in intelligent pilot training simulations. Second, we demonstrate a method, Temporally Aware Embedding (TAE) for learning an embedding that leverages temporal information extracted from data retrieval series. Third, the dissertation introduces GRAD (Graph Representation for Autonomous Driving) that incorporates the future location of neighboring vehicles into the decision-making process. We demonstrate the usage of our models for pilot training, cache usage prediction, and autonomous driving; however, believe that our new time series representations can be applied to many other types of modeling problems
DynED: Dynamic Ensemble Diversification in Data Stream Classification
Ensemble methods are commonly used in classification due to their remarkable
performance. Achieving high accuracy in a data stream environment is a
challenging task considering disruptive changes in the data distribution, also
known as concept drift. A greater diversity of ensemble components is known to
enhance prediction accuracy in such settings. Despite the diversity of
components within an ensemble, not all contribute as expected to its overall
performance. This necessitates a method for selecting components that exhibit
high performance and diversity. We present a novel ensemble construction and
maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically
combines the diversity and prediction accuracy of components during the process
of structuring an ensemble. The experimental results on both four real and 11
synthetic datasets demonstrate that the proposed approach (DynED) provides a
higher average mean accuracy compared to the five state-of-the-art baselines.Comment: Proceedings of the 32nd ACM International Conference on Information
and Knowledge Management (CIKM '23), October 21--25, 2023, Birmingham, United
Kingdo
Investigating the learning potential of the Second Quantum Revolution: development of an approach for secondary school students
In recent years we have witnessed important changes: the Second Quantum Revolution is in the spotlight of many countries, and it is creating a new generation of technologies.
To unlock the potential of the Second Quantum Revolution, several countries have launched strategic plans and research programs that finance and set the pace of research and development of these new technologies (like the Quantum Flagship, the National Quantum Initiative Act and so on).
The increasing pace of technological changes is also challenging science education and institutional systems, requiring them to help to prepare new generations of experts.
This work is placed within physics education research and contributes to the challenge by developing an approach and a course about the Second Quantum Revolution. The aims are to promote quantum literacy and, in particular, to value from a cultural and educational perspective the Second Revolution.
The dissertation is articulated in two parts. In the first, we unpack the Second Quantum Revolution from a cultural perspective and shed light on the main revolutionary aspects that are elevated to the rank of principles implemented in the design of a course for secondary school students, prospective and in-service teachers. The design process and the educational reconstruction of the activities are presented as well as the results of a pilot study conducted to investigate the impact of the approach on students' understanding and to gather feedback to refine and improve the instructional materials.
The second part consists of the exploration of the Second Quantum Revolution as a context to introduce some basic concepts of quantum physics. We present the results of an implementation with secondary school students to investigate if and to what extent external representations could play any role to promote studentsâ understanding and acceptance of quantum physics as a personal reliable description of the world
SMOClust: Synthetic Minority Oversampling based on Stream Clustering for Evolving Data Streams
Many real-world data stream applications not only suffer from concept drift
but also class imbalance. Yet, very few existing studies investigated this
joint challenge. Data difficulty factors, which have been shown to be key
challenges in class imbalanced data streams, are not taken into account by
existing approaches when learning class imbalanced data streams. In this work,
we propose a drift adaptable oversampling strategy to synthesise minority class
examples based on stream clustering. The motivation is that stream clustering
methods continuously update themselves to reflect the characteristics of the
current underlying concept, including data difficulty factors. This nature can
potentially be used to compress past information without caching data in the
memory explicitly. Based on the compressed information, synthetic examples can
be created within the region that recently generated new minority class
examples. Experiments with artificial and real-world data streams show that the
proposed approach can handle concept drift involving different minority class
decomposition better than existing approaches, especially when the data stream
is severely class imbalanced and presenting high proportions of safe and
borderline minority class examples.Comment: 59 pages, 85 figure
What Makes a Habitat a Home: Understanding Settlement and Recruitment Variation in European Sea Bass, Dicentrarchus labrax
Sea bass stocks in the UK are in decline as a result of increased fishing pressure and variable inter-annual recruitment. Recruitment variation is driven by survival in the early life stages; therefore, nursery habitats are thought to be able to stabilize recruitment through providing optimal growth conditions for juvenile fish. A thorough understanding of the factors that drive juvenile sea bass survival is needed, however, our understanding of what constitutes quality nursery habitat for juvenile sea bass is weak, with current knowledge based almost solely on saltmarshes. Juvenile sea bass were sampled using conventional seine and fyke nets across estuarine habitats, alongside dietary DNA metabarcoding to assess their distribution diet and condition, using measures of abundance, condition, stomach fullness, and diet. To determine whether the mechanism of larvae entering estuarine nurseries is an active or passive process the vertical distribution patterns of larval sea bass were compared across tidal cycles. Finally, over-winter survival was predicted based on energy budget modelling and temperature-dependent growth experiments, based on in-situ measurements of winter temperatures. Juvenile sea bass did not differentially select high tide habitats, but saltmarshes and sand provided increased foraging success. At low tide, however, sea bass were more abundant in complex habitat with lower foraging success. Diets mainly consisted of decapods and polychaete worms across habitats, but there was evidence of increased planktivory over mud. Larval sea bass did not show evidence of flood tide transport and likely rely on passive tidal forcing to migrate into estuaries, or they are trying to retain to deeper water. According to our models, winter thermal minima resulted in complete cohort loss in all scenarios on the East coast. The results of this study suggest that multiple habitats along the estuarine mosaic are important for juvenile sea bass at some point, and that a seascape approach to management is necessary, however, winter temperatures likely present a more extreme bottleneck to recruitment
Mining Butterflies in Streaming Graphs
This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection.
sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals
Conformance Checking-based Concept Drift Detection in Process Mining
One of the main challenges of process mining is to obtain
models that represent a process as simply and accurately as
possible. Both characteristics can be greatly influenced by
changes in the control flow of the process throughout its life
cycle.
In this thesis we propose the use of conformance metrics to
monitor such changes in a way that allows the division of the
log into sub-logs representing different versions of the process
over time. The validity of the hypothesis has been formally
demonstrated, showing that all kinds of changes in the process
flow can be captured using these approaches, including
sudden, gradual drifts on both clean and noisy environments,
where differentiating between anomalous executions and real
changes can be tricky
- âŠ