3,054 research outputs found

    Online semi-supervised learning in non-stationary environments

    Get PDF
    Existing Data Stream Mining (DSM) algorithms assume the availability of labelled and balanced data, immediately or after some delay, to extract worthwhile knowledge from the continuous and rapid data streams. However, in many real-world applications such as Robotics, Weather Monitoring, Fraud Detection Systems, Cyber Security, and Computer Network Traffic Flow, an enormous amount of high-speed data is generated by Internet of Things sensors and real-time data on the Internet. Manual labelling of these data streams is not practical due to time consumption and the need for domain expertise. Another challenge is learning under Non-Stationary Environments (NSEs), which occurs due to changes in the data distributions in a set of input variables and/or class labels. The problem of Extreme Verification Latency (EVL) under NSEs is referred to as Initially Labelled Non-Stationary Environment (ILNSE). This is a challenging task because the learning algorithms have no access to the true class labels directly when the concept evolves. Several approaches exist that deal with NSE and EVL in isolation. However, few algorithms address both issues simultaneously. This research directly responds to ILNSE’s challenge in proposing two novel algorithms “Predictor for Streaming Data with Scarce Labels” (PSDSL) and Heterogeneous Dynamic Weighted Majority (HDWM) classifier. PSDSL is an Online Semi-Supervised Learning (OSSL) method for real-time DSM and is closely related to label scarcity issues in online machine learning. The key capabilities of PSDSL include learning from a small amount of labelled data in an incremental or online manner and being available to predict at any time. To achieve this, PSDSL utilises both labelled and unlabelled data to train the prediction models, meaning it continuously learns from incoming data and updates the model as new labelled or unlabelled data becomes available over time. Furthermore, it can predict under NSE conditions under the scarcity of class labels. PSDSL is built on top of the HDWM classifier, which preserves the diversity of the classifiers. PSDSL and HDWM can intelligently switch and adapt to the conditions. The PSDSL adapts to learning states between self-learning, micro-clustering and CGC, whichever approach is beneficial, based on the characteristics of the data stream. HDWM makes use of “seed” learners of different types in an ensemble to maintain its diversity. The ensembles are simply the combination of predictive models grouped to improve the predictive performance of a single classifier. PSDSL is empirically evaluated against COMPOSE, LEVELIW, SCARGC and MClassification on benchmarks, NSE datasets as well as Massive Online Analysis (MOA) data streams and real-world datasets. The results showed that PSDSL performed significantly better than existing approaches on most real-time data streams including randomised data instances. PSDSL performed significantly better than ‘Static’ i.e. the classifier is not updated after it is trained with the first examples in the data streams. When applied to MOA-generated data streams, PSDSL ranked highest (1.5) and thus performed significantly better than SCARGC, while SCARGC performed the same as the Static. PSDSL achieved better average prediction accuracies in a short time than SCARGC. The HDWM algorithm is evaluated on artificial and real-world data streams against existing well-known approaches such as the heterogeneous WMA and the homogeneous Dynamic DWM algorithm. The results showed that HDWM performed significantly better than WMA and DWM. Also, when recurring concept drifts were present, the predictive performance of HDWM showed an improvement over DWM. In both drift and real-world streams, significance tests and post hoc comparisons found significant differences between algorithms, HDWM performed significantly better than DWM and WMA when applied to MOA data streams and 4 real-world datasets Electric, Spam, Sensor and Forest cover. The seeding mechanism and dynamic inclusion of new base learners in the HDWM algorithms benefit from the use of both forgetting and retaining the models. The algorithm also provides the independence of selecting the optimal base classifier in its ensemble depending on the problem. A new approach, Envelope-Clustering is introduced to resolve the cluster overlap conflicts during the cluster labelling process. In this process, PSDSL transforms the centroids’ information of micro-clusters into micro-instances and generates new clusters called Envelopes. The nearest envelope clusters assist the conflicted micro-clusters and successfully guide the cluster labelling process after the concept drifts in the absence of true class labels. PSDSL has been evaluated on real-world problem ‘keystroke dynamics’, and the results show that PSDSL achieved higher prediction accuracy (85.3%) and SCARGC (81.6%), while the Static (49.0%) significantly degrades the performance due to changes in the users typing pattern. Furthermore, the predictive accuracies of SCARGC are found highly fluctuated between (41.1% to 81.6%) based on different values of parameter ‘k’ (number of clusters), while PSDSL automatically determine the best values for this parameter

    Medical Image Analysis using Deep Relational Learning

    Full text link
    In the past ten years, with the help of deep learning, especially the rapid development of deep neural networks, medical image analysis has made remarkable progress. However, how to effectively use the relational information between various tissues or organs in medical images is still a very challenging problem, and it has not been fully studied. In this thesis, we propose two novel solutions to this problem based on deep relational learning. First, we propose a context-aware fully convolutional network that effectively models implicit relation information between features to perform medical image segmentation. The network achieves the state-of-the-art segmentation results on the Multi Modal Brain Tumor Segmentation 2017 (BraTS2017) and Multi Modal Brain Tumor Segmentation 2018 (BraTS2018) data sets. Subsequently, we propose a new hierarchical homography estimation network to achieve accurate medical image mosaicing by learning the explicit spatial relationship between adjacent frames. We use the UCL Fetoscopy Placenta dataset to conduct experiments and our hierarchical homography estimation network outperforms the other state-of-the-art mosaicing methods while generating robust and meaningful mosaicing result on unseen frames.Comment: arXiv admin note: substantial text overlap with arXiv:2007.0778

    From Human Behavior to Machine Behavior

    Get PDF
    A core pursuit of artificial intelligence is the comprehension of human behavior. Imbuing intelligent agents with a good human behavior model can help them understand how to behave intelligently and interactively in complex situations. Due to the increase in data availability and computational resources, the development of machine learning algorithms for duplicating human cognitive abilities has made rapid progress. To solve difficult scenarios, learning-based methods must search for solutions in a predefined but large space. Along with implementing a smart exploration strategy, the right representation for a task can help narrow the search process during learning. This dissertation tackles three important aspects of machine intelligence: 1) prediction, 2) exploration, and 3) representation. More specifically we develop new algorithms for 1) predicting the future maneuvers or outcomes in pilot training and computer architecture applications; 2) exploration strategies for reinforcement learning in game environments and 3) scene representations for autonomous driving agents capable of handling large numbers of dynamic entities. This dissertation makes the following research contributions in the area of representation learning. First, we introduce a new time series representation for flight trajectories in intelligent pilot training simulations. Second, we demonstrate a method, Temporally Aware Embedding (TAE) for learning an embedding that leverages temporal information extracted from data retrieval series. Third, the dissertation introduces GRAD (Graph Representation for Autonomous Driving) that incorporates the future location of neighboring vehicles into the decision-making process. We demonstrate the usage of our models for pilot training, cache usage prediction, and autonomous driving; however, believe that our new time series representations can be applied to many other types of modeling problems

    DynED: Dynamic Ensemble Diversification in Data Stream Classification

    Full text link
    Ensemble methods are commonly used in classification due to their remarkable performance. Achieving high accuracy in a data stream environment is a challenging task considering disruptive changes in the data distribution, also known as concept drift. A greater diversity of ensemble components is known to enhance prediction accuracy in such settings. Despite the diversity of components within an ensemble, not all contribute as expected to its overall performance. This necessitates a method for selecting components that exhibit high performance and diversity. We present a novel ensemble construction and maintenance approach based on MMR (Maximal Marginal Relevance) that dynamically combines the diversity and prediction accuracy of components during the process of structuring an ensemble. The experimental results on both four real and 11 synthetic datasets demonstrate that the proposed approach (DynED) provides a higher average mean accuracy compared to the five state-of-the-art baselines.Comment: Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23), October 21--25, 2023, Birmingham, United Kingdo

    Investigating the learning potential of the Second Quantum Revolution: development of an approach for secondary school students

    Get PDF
    In recent years we have witnessed important changes: the Second Quantum Revolution is in the spotlight of many countries, and it is creating a new generation of technologies. To unlock the potential of the Second Quantum Revolution, several countries have launched strategic plans and research programs that finance and set the pace of research and development of these new technologies (like the Quantum Flagship, the National Quantum Initiative Act and so on). The increasing pace of technological changes is also challenging science education and institutional systems, requiring them to help to prepare new generations of experts. This work is placed within physics education research and contributes to the challenge by developing an approach and a course about the Second Quantum Revolution. The aims are to promote quantum literacy and, in particular, to value from a cultural and educational perspective the Second Revolution. The dissertation is articulated in two parts. In the first, we unpack the Second Quantum Revolution from a cultural perspective and shed light on the main revolutionary aspects that are elevated to the rank of principles implemented in the design of a course for secondary school students, prospective and in-service teachers. The design process and the educational reconstruction of the activities are presented as well as the results of a pilot study conducted to investigate the impact of the approach on students' understanding and to gather feedback to refine and improve the instructional materials. The second part consists of the exploration of the Second Quantum Revolution as a context to introduce some basic concepts of quantum physics. We present the results of an implementation with secondary school students to investigate if and to what extent external representations could play any role to promote students’ understanding and acceptance of quantum physics as a personal reliable description of the world

    SMOClust: Synthetic Minority Oversampling based on Stream Clustering for Evolving Data Streams

    Full text link
    Many real-world data stream applications not only suffer from concept drift but also class imbalance. Yet, very few existing studies investigated this joint challenge. Data difficulty factors, which have been shown to be key challenges in class imbalanced data streams, are not taken into account by existing approaches when learning class imbalanced data streams. In this work, we propose a drift adaptable oversampling strategy to synthesise minority class examples based on stream clustering. The motivation is that stream clustering methods continuously update themselves to reflect the characteristics of the current underlying concept, including data difficulty factors. This nature can potentially be used to compress past information without caching data in the memory explicitly. Based on the compressed information, synthetic examples can be created within the region that recently generated new minority class examples. Experiments with artificial and real-world data streams show that the proposed approach can handle concept drift involving different minority class decomposition better than existing approaches, especially when the data stream is severely class imbalanced and presenting high proportions of safe and borderline minority class examples.Comment: 59 pages, 85 figure

    What Makes a Habitat a Home: Understanding Settlement and Recruitment Variation in European Sea Bass, Dicentrarchus labrax

    Get PDF
    Sea bass stocks in the UK are in decline as a result of increased fishing pressure and variable inter-annual recruitment. Recruitment variation is driven by survival in the early life stages; therefore, nursery habitats are thought to be able to stabilize recruitment through providing optimal growth conditions for juvenile fish. A thorough understanding of the factors that drive juvenile sea bass survival is needed, however, our understanding of what constitutes quality nursery habitat for juvenile sea bass is weak, with current knowledge based almost solely on saltmarshes. Juvenile sea bass were sampled using conventional seine and fyke nets across estuarine habitats, alongside dietary DNA metabarcoding to assess their distribution diet and condition, using measures of abundance, condition, stomach fullness, and diet. To determine whether the mechanism of larvae entering estuarine nurseries is an active or passive process the vertical distribution patterns of larval sea bass were compared across tidal cycles. Finally, over-winter survival was predicted based on energy budget modelling and temperature-dependent growth experiments, based on in-situ measurements of winter temperatures. Juvenile sea bass did not differentially select high tide habitats, but saltmarshes and sand provided increased foraging success. At low tide, however, sea bass were more abundant in complex habitat with lower foraging success. Diets mainly consisted of decapods and polychaete worms across habitats, but there was evidence of increased planktivory over mud. Larval sea bass did not show evidence of flood tide transport and likely rely on passive tidal forcing to migrate into estuaries, or they are trying to retain to deeper water. According to our models, winter thermal minima resulted in complete cohort loss in all scenarios on the East coast. The results of this study suggest that multiple habitats along the estuarine mosaic are important for juvenile sea bass at some point, and that a seascape approach to management is necessary, however, winter temperatures likely present a more extreme bottleneck to recruitment

    Mining Butterflies in Streaming Graphs

    Get PDF
    This thesis introduces two main-memory systems sGrapp and sGradd for performing the fundamental analytic tasks of biclique counting and concept drift detection over a streaming graph. A data-driven heuristic is used to architect the systems. To this end, initially, the growth patterns of bipartite streaming graphs are mined and the emergence principles of streaming motifs are discovered. Next, the discovered principles are (a) explained by a graph generator called sGrow; and (b) utilized to establish the requirements for efficient, effective, explainable, and interpretable management and processing of streams. sGrow is used to benchmark stream analytics, particularly in the case of concept drift detection. sGrow displays robust realization of streaming growth patterns independent of initial conditions, scale and temporal characteristics, and model configurations. Extensive evaluations confirm the simultaneous effectiveness and efficiency of sGrapp and sGradd. sGrapp achieves mean absolute percentage error up to 0.05/0.14 for the cumulative butterfly count in streaming graphs with uniform/non-uniform temporal distribution and a processing throughput of 1.5 million data records per second. The throughput and estimation error of sGrapp are 160x higher and 0.02x lower than baselines. sGradd demonstrates an improving performance over time, achieves zero false detection rates when there is not any drift and when drift is already detected, and detects sequential drifts in zero to a few seconds after their occurrence regardless of drift intervals

    Conformance Checking-based Concept Drift Detection in Process Mining

    Get PDF
    One of the main challenges of process mining is to obtain models that represent a process as simply and accurately as possible. Both characteristics can be greatly influenced by changes in the control flow of the process throughout its life cycle. In this thesis we propose the use of conformance metrics to monitor such changes in a way that allows the division of the log into sub-logs representing different versions of the process over time. The validity of the hypothesis has been formally demonstrated, showing that all kinds of changes in the process flow can be captured using these approaches, including sudden, gradual drifts on both clean and noisy environments, where differentiating between anomalous executions and real changes can be tricky
    • 

    corecore