1,222 research outputs found

    Exploring probabilistic models for semi-supervised learning

    Get PDF
    Deep neural networks are increasingly harnessed for computer vision tasks, thanks to their robust performance. However, their training demands large-scale labeled datasets, which are labor-intensive to prepare. Semi-supervised learning (SSL) offers a solution by learning from a mix of labeled and unlabeled data. While most state-of-the-art SSL methods follow a deterministic approach, the exploration of their probabilistic counterparts remains limited. This research area is important because probabilistic models can provide uncertainty estimates critical for real-world applications. For instance, SSL-trained models may fall short of those trained with supervised learning due to potential pseudo-label errors in unlabeled data, and these models are more likely to make wrong predictions in practice. Especially in critical sectors like medical image analysis and autonomous driving, decision-makers must understand the model’s limitations and when incorrect predictions may occur, insights often provided by uncertainty estimates. Furthermore, uncertainty can also serve as a criterion for filtering out unreliable pseudo-labels when unlabeled samples are used for training, potentially improving deep model performance. This thesis furthers the exploration of probabilistic models for SSL. Drawing on the widely-used Bayesian approximation tool, Monte Carlo (MC) dropout, I propose a new probabilistic framework, the Generative Bayesian Deep Learning (GBDL) architecture, for semi-supervised medical image segmentation. This approach not only mitigates potential overfitting found in previous methods but also achieves superior results across four evaluation metrics. Unlike its empirically designed predecessors, GBDL is underpinned by a full Bayesian formulation, providing a theoretical probabilistic foundation. Acknowledging MC dropout’s limitations, I introduce NP-Match, a novel proba- bilistic approach for large-scale semi-supervised image classification. We evaluated NP-Match’s generalization capabilities through extensive experiments in different challenging settings such as standard, imbalanced, and multi-label semi-supervised image classification. According to the experimental results, NP-Match not only competes favorably with previous state-of-the-art methods but also estimates uncertainty more rapidly than MC-dropout-based models, thus enhancing both training and testing efficiency. Lastly, I propose NP-SemiSeg, a new probabilistic model for semi-supervised se- mantic segmentation. This flexible model can be integrated with various existing segmentation frameworks to make predictions and estimate uncertainty. Experiments indicate that NP-SemiSeg surpasses MC dropout in accuracy, uncertainty quantification, and speed

    Stream-based active learning with linear models

    Full text link
    The proliferation of automated data collection schemes and the advances in sensorics are increasing the amount of data we are able to monitor in real-time. However, given the high annotation costs and the time required by quality inspections, data is often available in an unlabeled form. This is fostering the use of active learning for the development of soft sensors and predictive models. In production, instead of performing random inspections to obtain product information, labels are collected by evaluating the information content of the unlabeled data. Several query strategy frameworks for regression have been proposed in the literature but most of the focus has been dedicated to the static pool-based scenario. In this work, we propose a new strategy for the stream-based scenario, where instances are sequentially offered to the learner, which must instantaneously decide whether to perform the quality check to obtain the label or discard the instance. The approach is inspired by the optimal experimental design theory and the iterative aspect of the decision-making process is tackled by setting a threshold on the informativeness of the unlabeled data points. The proposed approach is evaluated using numerical simulations and the Tennessee Eastman Process simulator. The results confirm that selecting the examples suggested by the proposed algorithm allows for a faster reduction in the prediction error.Comment: Published in Knowledge-Based Systems (2022

    A survey on online active learning

    Full text link
    Online active learning is a paradigm in machine learning that aims to select the most informative data points to label from a data stream. The problem of minimizing the cost associated with collecting labeled observations has gained a lot of attention in recent years, particularly in real-world applications where data is only available in an unlabeled form. Annotating each observation can be time-consuming and costly, making it difficult to obtain large amounts of labeled data. To overcome this issue, many active learning strategies have been proposed in the last decades, aiming to select the most informative observations for labeling in order to improve the performance of machine learning models. These approaches can be broadly divided into two categories: static pool-based and stream-based active learning. Pool-based active learning involves selecting a subset of observations from a closed pool of unlabeled data, and it has been the focus of many surveys and literature reviews. However, the growing availability of data streams has led to an increase in the number of approaches that focus on online active learning, which involves continuously selecting and labeling observations as they arrive in a stream. This work aims to provide an overview of the most recently proposed approaches for selecting the most informative observations from data streams in the context of online active learning. We review the various techniques that have been proposed and discuss their strengths and limitations, as well as the challenges and opportunities that exist in this area of research. Our review aims to provide a comprehensive and up-to-date overview of the field and to highlight directions for future work

    Active Learning: New Approaches, and Industrial Applications

    Get PDF
    Active learning is one form of supervised machine learning. In supervised learning, a set of labeled samples is passed to a learning algorithm for training a classifier. However, labeling large amounts of training samples can be costly and error-prone. Active learning deals with the development of algorithms that interactively select a subset of the available unlabeled samples for labeling, and aims at minimizing the labeling effort while maintaining classification performance. The key challenge for the development of so-called active learning strategies is the balance between exploitation and exploration: On the one hand, the estimated decision boundary needs to be refined in feature space regions where it has already been established, while, on the other hand, the feature space needs to be scanned carefully for unexpected class distributions. In this thesis, two approaches to active learning are presented that consider these two aspects in a novel way. In order to lay the foundations for the first one, it is proposed to express the uncertainty in class prediction of a classifier at a test point in terms of a second-order distribution. The mean of this distribution corresponds to the common estimate of the posterior class probabilities and thus is related to the distance of the test point to the decision boundary, whereas the spread of the distribution indicates the degree of exploration in the corresponding region of feature space. This allows for the evaluation of the utility of labeling a yet unlabeled point with respect to classifier improvement in a principled way and leads to a completely novel approach to active learning. The proposed strategy is then implemented and evaluated based on kernel density classification. The generic active learning strategy can be combined with any other classifier, but it performs best if the derived second-order distributions are sufficiently good approximations to the sampling distribution. Although second-order distributions for random forests are derived in this thesis, they do not approximate sufficiently well the sampling distribution and mainly allow only for the relative comparison of prediction uncertainty between test points. In order to combine the state of the art classification performance of random forests with the principal ideas of the first active learning approach, a related second approach for random forests is derived. It is, in addition, tailored to the demands in industrial optical inspection: bag-wise labeling with weak labels and strongly imbalanced classes. Moreover, an outlier detection scheme based on random forests is derived that is used by the proposed active learning algorithm. Finally, a new computational scheme for Gaussian process classification is presented. It is compared to two standard methods in geostatistics, both with respect to theoretical consistency and practical performance. The method evolved as a by-product when considering using Gaussian process models for active learning

    Solving the challenges of concept drift in data stream classification.

    Get PDF
    The rise of network connected devices and applications leads to a significant increase in the volume of data that are continuously generated overtime time, called data streams. In real world applications, storing the entirety of a data stream for analyzing later is often not practical, due to the data stream’s potentially infinite volume. Data stream mining techniques and frameworks are therefore created to analyze streaming data as they arrive. However, compared to traditional data mining techniques, challenges unique to data stream mining also emerge, due to the high arrival rate of data streams and their dynamic nature. In this dissertation, an array of techniques and frameworks are presented to improve the solutions on some of the challenges. First, this dissertation acknowledges that a “no free lunch” theorem exists for data stream mining, where no silver bullet solution can solve all problems of data stream mining. The dissertation focuses on detection of changes of data distribution in data stream mining. These changes are called concept drift. Concept drift can be categorized into many types. A detection algorithm often works only on some types of drift, but not all of them. Because of this, the dissertation finds specific techniques to solve specific challenges, instead of looking for a general solution. Then, this dissertation considers improving solutions for the challenges of high arrival rate of data streams. Data stream mining frameworks often need to process vast among of data samples in limited time. Some data mining activities, notably data sample labeling for classification, are too costly or too slow in such large scale. This dissertation presents two techniques that reduce the amount of labeling needed for data stream classification. The first technique presents a grid-based label selection process that apply to highly imbalanced data streams. Such data streams have one class of data samples vastly outnumber another class. Many majority class samples need to be labeled before a minority class sample can be found due to the imbalance. The presented technique divides the data samples into groups, called grids, and actively search for minority class samples that are close by within a grid. Experiment results show the technique can reduce the total number of data samples needed to be labeled. The second technique presents a smart preprocessing technique that reduce the number of times a new learning model needs to be trained due to concept drift. Less model training means less data labels required, and thus costs less. Experiment results show that in some cases the reduced performance of learning models is the result of improper preprocessing of the data, not due to concept drift. By adapting preprocessing to the changes in data streams, models can retain high performance without retraining. Acknowledging the high cost of labeling, the dissertation then considers the scenario where labels are unavailable when needed. The framework Sliding Reservoir Approach for Delayed Labeling (SRADL) is presented to explore solutions to such problem. SRADL tries to solve the delayed labeling problem where concept drift occurs, and no labels are immediately available. SRADL uses semi-supervised learning by employing a sliding windowed approach to store historical data, which is combined with newly unlabeled data to train new models. Experiments show that SRADL perform well in some cases of delayed labeling. Next, the dissertation considers improving solutions for the challenge of dynamism within data streams, most notably concept drift. The complex nature of concept drift means that most existing detection algorithms can only detect limited types of concept drift. To detect more types of concept drift, an ensemble approach that employs various algorithms, called Heuristic Ensemble Framework for Concept Drift Detection (HEFDD), is presented. The occurrence of each type of concept drift is voted on by the detection results of each algorithm in the ensemble. Types of concept drift with votes past majority are then declared detected. Experiment results show that HEFDD is able to improve detection accuracy significantly while reducing false positives. With the ability to detect various types of concept drift provided by HEFDD, the dissertation tries to improve the delayed labeling framework SRADL. A new combined framework, SRADL-HEFDD is presented, which produces synthetic labels to handle the unavailability of labels by human expert. SRADL-HEFDD employs different synthetic labeling techniques based on different types of drift detected by HEFDD. Experimental results show that comparing to the default SRADL, the combined framework improves prediction performance when small amount of labeled samples is available. Finally, as machine learning applications are increasingly used in critical domains such as medical diagnostics, accountability, explainability and interpretability of machine learning algorithms needs to be considered. Explainable machine learning aims to use a white box approach for data analytics, which enables learning models to be explained and interpreted by human users. However, few studies have been done on explaining what has changed in a dynamic data stream environment. This dissertation thus presents Data Stream Explainability (DSE) framework. DSE visualizes changes in data distribution and model classification boundaries between chunks of streaming data. The visualizations can then be used by a data mining researcher to generate explanations of what has changed within the data stream. To show that DSE can help average users understand data stream mining better, a survey was conducted with an expert group and a non-expert group of users. Results show DSE can reduce the gap of understanding what changed in data stream mining between the two groups

    Semi-supervised active learning anomaly detection

    Get PDF
    Mestrado Bolonha em Data Analytics for BusinessThe analysis of Time Series data is a growing field of study due to the increase in the rate of data collection from the most varied sensors that lead to an overload of information to be analysed in order to obtain the most accurate conclusions possible. Hence, due to the high volume of data without labels, automatized detection and labelling of anomalies in Time Series data is an active area of research, as it becomes impossible to manually identify abnormal behavior in Time Series because of the high time and monetary costs. This research focus on the investigation of the power of a Semi Supervised Active Learning algorithm to identify outlier-type anomalies in univariate Time Series. To maximize the performance of the algorithm, we start by proposing an initial pool of features from which the ones with best classification power are selected to develop the algorithm. Regarding the Semi Supervised Learning segment of the process a comparison between several classifiers is made. In addition, various Query Strategies are proposed in the Active Learning segment to increase the informativeness of the observations chosen to be manually labelled so that the time spent labelling anomalies could be decreased without a great impact in the performance of the model. In a first instance, we demonstrate that the pool of designed features better identifies the anomalies than features selected in a fully automatized process. Furthermore, we demonstrate that a Query Strategy used to select the most informative observations to be expertly classified based on the utility and uncertainty of the observations exhibit better results than randomly selecting the observations to be tagged, improving the performance of the model without infeasible time and cost spent in the identification of the anomalous behavior.info:eu-repo/semantics/publishedVersio
    • …
    corecore