10 research outputs found

    Sampling with Confidence: Using k-NN Confidence Measures in Active Learning

    Get PDF
    Active learning is a process through which classifiers can be built from collections of unlabelled examples through the cooperation of a human oracle who can label a small number of examples selected as most informative. Typically the most informative examples are selected through uncertainty sampling based on classification scores. However, previous work has shown that, contrary to expectations, there is not a direct relationship between classification scores and classification confidence. Fortunately, there exists a collection of particularly effective techniques for building measures of classification confidence from the similarity information generated by k-NN classifiers. This paper investigates using these confidence measures in a new active learning sampling selection strategy, and shows how the performance of this strategy is better than one based on uncertainty sampling using classification scores

    Uncertainty-Aware Organ Classification for Surgical Data Science Applications in Laparoscopy

    Get PDF
    Objective: Surgical data science is evolving into a research field that aims to observe everything occurring within and around the treatment process to provide situation-aware data-driven assistance. In the context of endoscopic video analysis, the accurate classification of organs in the field of view of the camera proffers a technical challenge. Herein, we propose a new approach to anatomical structure classification and image tagging that features an intrinsic measure of confidence to estimate its own performance with high reliability and which can be applied to both RGB and multispectral imaging (MI) data. Methods: Organ recognition is performed using a superpixel classification strategy based on textural and reflectance information. Classification confidence is estimated by analyzing the dispersion of class probabilities. Assessment of the proposed technology is performed through a comprehensive in vivo study with seven pigs. Results: When applied to image tagging, mean accuracy in our experiments increased from 65% (RGB) and 80% (MI) to 90% (RGB) and 96% (MI) with the confidence measure. Conclusion: Results showed that the confidence measure had a significant influence on the classification accuracy, and MI data are better suited for anatomical structure labeling than RGB data. Significance: This work significantly enhances the state of art in automatic labeling of endoscopic videos by introducing the use of the confidence metric, and by being the first study to use MI data for in vivo laparoscopic tissue classification. The data of our experiments will be released as the first in vivo MI dataset upon publication of this paper.Comment: 7 pages, 6 images, 2 table

    Prediction of player position for talent identification in association netball: a regression-based approach

    Get PDF
    Among the challenges in industrial revolutions, 4.0 is managing organizations’ talents, especially to ensure the right person for the position can be selected. This study is set to introduce a predictive approach for talent identification in the sport of netball using individual player qualities in terms of physical fitness, mental capacity, and technical skills. A data mining approach is proposed using three data mining algorithms, which are Decision Tree (DT), Neural Network (NN), and Linear Regressions (LR). All the models are then compared based on the Relative Absolute Error (RAE), Mean Absolute Error (MAE), Relative Square Error (RSE), Root Mean Square Error (RMSE), Coefficient of Determination (R2), and Relative Square Error (RSE). The findings are presented and discussed in light of early talent spotting and selection. Generally, LR has the best performance in terms of MAE and RMSE as it has the lowest values among the three models

    KLEOR: A Knowledge Lite Approach to Explanation Oriented Retrieval

    Get PDF
    In this paper, we describe precedent-based explanations for case-based classification systems. Previous work has shown that explanation cases that are more marginal than the query case, in the sense of lying between the query case and the decision boundary, are more convincing explanations. We show how to retrieve such explanation cases in a way that requires lower knowledge engineering overheads than previously. We evaluate our approaches empirically, finding that the explanations that our systems retrieve are often more convincing than those found by the previous approach. The paper ends with a thorough discussion of a range of factors that affect precedent-based explanations, many of which warrant further research

    Semi-Supervised Clustering with Partial Background Information

    Full text link

    Confidence of a k-Nearest Neighbors Python Algorithm for the 3D Visualization of Sedimentary Porous Media

    Get PDF
    In a previous paper, the authors implemented a machine learning k-nearest neighbors (KNN) algorithm and Python libraries to create two 3D interactive models of the stratigraphic architecture of the Quaternary onshore Llobregat River Delta (NE Spain) for groundwater exploration purposes. The main limitation of this previous paper was its lack of routines for evaluating the confidence of the 3D models. Building from the previous paper, this paper refines the programming code and introduces an additional algorithm to evaluate the confidence of the KNN predictions. A variant of the Similarity Ratio method was used to quantify the KNN prediction confidence. This variant used weights that were inversely proportional to the distance between each grain-size class and the inferred point to work out a value that played the role of similarity. While the KNN algorithm and Python libraries demonstrated their efficacy for obtaining 3D models of the stratigraphic arrangement of sedimentary porous media, the KNN prediction confidence verified the certainty of the 3D models. In the Llobregat River Delta, the KNN prediction confidence at each prospecting depth was a function of the available data density at that depth. As expected, the KNN prediction confidence decreased according to the decreasing data density at lower depths. The obtained average-weighted confidence was in the 0.44−0.53 range for gravel bodies at prospecting depths in the 12.7−72.4 m b.s.l. range and was in the 0.42−0.55 range for coarse sand bodies at prospecting depths in the 4.6−83.9 m b.s.l. range. In a couple of cases, spurious average-weighted confidences of 0.29 in one gravel body and 0.30 in one coarse sand body were obtained. These figures were interpreted as the result of the quite different weights of neighbors from different grain-size classes at short distances. The KNN algorithm confidence has proven its suitability for identifying these anomalous results in the supposedly well-depurated grain-size database used in this study. The introduced KNN algorithm confidence quantifies the reliability of the 3D interactive models, which is a necessary stage to make decisions in economic and environmental geology. In the Llobregat River Delta, this quantification clearly improves groundwater exploration predictability.Research Project PID2020-114381GB-100 of the Spanish Ministry of Science and Innovation, Research Project 101086497 of the Horizon Europe Framework Programme HORIZON-CL6-2022-GOVERNANCE-01-07, Research Groups and Projects of the Generalitat Valenciana from the University of Alicante (CTMA-IGA), and Research Groups FQM-343 and RNM-188 of the Junta de Andalucía

    Application of knowledge discovery in databases : automating manual tasks

    Get PDF
    Businesses have large data stored in databases and data warehouses that is beyond the scope of traditional analysis methods. Knowledge discovery in databases (KDD) has been applied to get insight from this large business data. In this study, I investigated the application of KDD to automate two manual tasks in a Finnish company that pro-vides financial automation solutions. The objective of the study was to develop mod-els from historical data and use the models to handle future transactions to minimize or omit the manual tasks. Historical data about the manual tasks was extracted from the database. The data was prepared and three machine learning methods were used to develop classification models from the data. The three machine learning methods used are decision tree, Na-ïve Bayes, and k-nearest neighbor. The developed models were evaluated on test data. The models were evaluated based on accuracy and prediction rate. Overall, decision tree had the highest accuracy while k-nearest neighbor has the highest prediction rate. However, there were significant differences in performance across datasets. Overall, the results show that there are patterns in the data that can be used to auto-mate the manual tasks. Due to time constraints data preparation was not done thoroughly. In future iterations, a better data preparation could result in a better result. Moreover, further study to determine the effect of type of transactions on modeling is required. It can be concluded that knowledge discovery methods and tools can be used to automate the manual task

    Generating Estimates of Classification Confidence for a Case-Based Spam Filter

    No full text
    Abstract. Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour or Naive Bayes) could readily produce confidence estimates based on thresholds. In fact, this proves not to be the case, probably because these are not probabilistic classifiers in the strict sense. The numeric scores coming from k-Nearest Neighbour or Naive Bayes classifiers are not well correlated with classification confidence. In this paper we describe a case-based spam filtering application that would benefit significantly from an ability to attach confidence predictions to positive classifications (i.e. messages classified as spam). We show that ‘obvious ’ confidence metrics for a case-based classifier are not effective. We propose an ensemble-like solution that aggregates a collection of confidence metrics and show that this offers an effective solution in this spam filtering domain.

    Generating estimates of classification confidence for a case-based spam filter

    Get PDF
    Producing estimates of classification confidence is surprisingly difficult. One might expect that classifiers that can produce numeric classification scores (e.g. k-Nearest Neighbour, Na¨ıve Bayes or Support Vector Machines) could readily produce confidence estimates based on thresholds. In fact, this proves not to be the case, probably because these are not probabilistic classifiers in the strict sense. The numeric scores coming from k-Nearest Neighbour, Na¨ıve Bayes and Support Vector Machine classifiers are not well correlated with classification confidence. In this paper we describe a case-based spam filtering application that would benefit significantly from an ability to attach confidence predictions to positive classifications (i.e. messages classified as spam). We show that ‘obvious’ confidence metrics for a case-based classifier are not effective. We propose an ensemble-like solution that aggregates a collection of confidence metrics and show that this offers an effective solution in this spam filtering domain

    Active Learning for Text Classification

    Get PDF
    Text classification approaches are used extensively to solve real-world challenges. The success or failure of text classification systems hangs on the datasets used to train them, without a good dataset it is impossible to build a quality system. This thesis examines the applicability of active learning in text classification for the rapid and economical creation of labelled training data. Four main contributions are made in this thesis. First, we present two novel selection strategies to choose the most informative examples for manually labelling. One is an approach using an advanced aggregated confidence measurement instead of the direct output of classifiers to measure the confidence of the prediction and choose the examples with least confidence for querying. The other is a simple but effective exploration guided active learning selection strategy which uses only the notions of density and diversity, based on similarity, in its selection strategy. Second, we propose new methods of using deterministic clustering algorithms to help bootstrap the active learning process. We first illustrate the problems of using non-deterministic clustering for selecting initial training sets, showing how non-deterministic clustering methods can result in inconsistent behaviour in the active learning process. We then compare various deterministic clustering techniques and commonly used non-deterministic ones, and show that deterministic clustering algorithms are as good as non-deterministic clustering algorithms at selecting initial training examples for the active learning process. More importantly, we show that the use of deterministic approaches stabilises the active learning process. Our third direction is in the area of visualising the active learning process. We demonstrate the use of an existing visualisation technique in understanding active learning selection strategies to show that a better understanding of selection strategies can be achieved with the help of visualisation techniques. Finally, to evaluate the practicality and usefulness of active learning as a general dataset labelling methodology, it is desirable that actively labelled dataset can be reused more widely instead of being only limited to some particular classifier. We compare the reusability of popular active learning methods for text classification and identify the best classifiers to use in active learning for text classification. This thesis is concerned using active learning methods to label large unlabelled textual datasets. Our domain of interest is text classification, but most of the methods proposed are quite general and so are applicable to other domains having large collections of data with high dimensionality
    corecore