11 research outputs found

    Cost-Sensitive Learning for Recurrence Prediction of Breast Cancer

    Get PDF
    Breast cancer is one of the top cancer-death causes and specifically accounts for 10.4% of all cancer incidences among women. The prediction of breast cancer recurrence has been a challenging research problem for many researchers. Data mining techniques have recently received considerable attention, especially when used for the construction of prognosis models from survival data. However, existing data mining techniques may not be effective to handle censored data. Censored instances are often discarded when applying classification techniques to prognosis. In this paper, we propose a cost-sensitive learning approach to involve the censored data in prognostic assessment with better recurrence prediction capability. The proposed approach employs an outcome inference mechanism to infer the possible probabilistic outcome of each censored instance and adopt the cost-proportionate rejection sampling and a committee machine strategy to take into account these instances with probabilistic outcomes during the classification model learning process. We empirically evaluate the effectiveness of our proposed approach for breast cancer recurrence prediction and include a censored-data-discarding method (i.e., building the recurrence prediction model by only using uncensored data) and the Kaplan-Meier method (a common prognosis method) as performance benchmarks. Overall, our evaluation results suggest that the proposed approach outperforms its benchmark techniques, measured by precision, recall and F1 score

    Learning Using Privileged Information: SVM+ and Weighted SVM

    Full text link
    Prior knowledge can be used to improve predictive performance of learning algorithms or reduce the amount of data required for training. The same goal is pursued within the learning using privileged information paradigm which was recently introduced by Vapnik et al. and is aimed at utilizing additional information available only at training time -- a framework implemented by SVM+. We relate the privileged information to importance weighting and show that the prior knowledge expressible with privileged features can also be encoded by weights associated with every training example. We show that a weighted SVM can always replicate an SVM+ solution, while the converse is not true and we construct a counterexample highlighting the limitations of SVM+. Finally, we touch on the problem of choosing weights for weighted SVMs when privileged features are not available.Comment: 18 pages, 8 figures; integrated reviewer comments, improved typesettin

    Positive Example Learning for Content-Based Recommendations: A Cost-Sensitive Learning-Based Approach

    Get PDF
    Existing supervised learning techniques can support product recommendations but are ineffective in scenarios characterized by single-class learning; i.e., training samples consisted of some positive examples and a much greater number of unlabeled examples. To address the limitations inherent in existing single-class learning techniques, we develop COst-sensitive Learning-based Positive Example Learning (COLPEL), which constructs an automated classifier from a singleclass training sample. Our method employs cost-proportionate rejection sampling to derive, from unlabeled examples, a subset likely to feature negative examples, according to the respective misclassification costs. COLPEL follows a committee machine strategy, thereby constructing a set of automated classifiers used together to reduce probable biases common to a single classifier. We use customers’ book ratings from the Amazon.com Web site to evaluate COLPEL, with PNB and PEBL as benchmarks. Our results show that COLPEL outperforms both PNB and PEBL, as measured by its accuracy, positive F1 score, and negative F1 score

    New models and methods for classification and feature selection. a mathematical optimization perspective

    Get PDF
    The objective of this PhD dissertation is the development of new models for Supervised Classification and Benchmarking, making use of Mathematical Optimization and Statistical tools. Particularly, we address the fusion of instruments from both disciplines, with the aim of extracting knowledge from data. In such a way, we obtain innovative methodologies that overcome to those existing ones, bridging theoretical Mathematics with real-life problems. The developed works along this thesis have focused on two fundamental methodologies in Data Science: support vector machines (SVM) and Benchmarking. Regarding the first one, the SVM classifier is based on the search for the separating hyperplane of maximum margin and it is written as a quadratic convex problem. In the Benchmarking context, the goal is to calculate the different efficiencies through a non-parametric deterministic approach. In this thesis we will focus on Data Envelopment Analysis (DEA), which consists on a Linear Programming formulation. This dissertation is structured as follows. In Chapter 1 we briefly present the different challenges this thesis faces on, as well as their state-of-the-art. In the same vein, the different formulations used as base models are exposed, together with the notation used along the chapters in this thesis. In Chapter 2, we tackle the problem of the construction of a version of the SVM that considers misclassification errors. To do this, we incorporate new performance constraints in the SVM formulation, imposing upper bounds on the misclassification errors. The resulting formulation is a quadratic convex problem with linear constraints. Chapter 3 continues with the SVM as the basis, and sets out the problem of providing not only a hard-labeling for each of the individuals belonging to the dataset, but a class probability estimation. Furthermore, confidence intervals for both the score values and the posterior class probabilities will be provided. In addition, as in the previous chapter, we will carry the obtained results to the field in which misclassified errors are considered. With such a purpose, we have to solve either a quadratic convex problem or a quadratic convex problem with linear constraints and integer variables, and always taking advantage of the parameter tuning of the SVM, that is usually wasted. Based on the results in Chapter 2, in Chapter 4 we handle the problem of feature selection, taking again into account the misclassification errors. In order to build this technique, the feature selection is embedded in the classifier model. Such a process is divided in two different steps. In the first step, feature selection is performed while at the same time data is separated via an hyperplane or linear classifier, considering the performance constraints. In the second step, we build the maximum margin classifier (SVM) using the selected features from the first step, and again taking into account the same performance constraints. In Chapter 5, we move to the problem of Benchmarking, where the practices of different entities are compared through the products or services they provide. This is done with the aim of make some changes or improvements in each of them. Concretely, in this chapter we propose a Mixed Integer Linear Programming formulation based in Data Envelopment Analysis (DEA), with the aim of perform feature selection, improving the interpretability and comprehension of the obtained model and efficiencies. Finally, in Chapter 6 we collect the conclusions of this thesis as well as future lines of research

    Classification and Decision-Theoretic Framework for Detecting and Reporting Unseen Falls

    Get PDF
    Detecting falls is critical for an activity recognition system to ensure the well being of an individual. However, falls occur rarely and infrequently, therefore sufficient data for them may not be available during training of the classifiers. Building a fall detection system in the absence of fall data is very challenging and can severely undermine the generalization capabilities of an activity recognition system. In this thesis, we present ideas from both classification and decision theory perspectives to handle scenarios when the training data for falls is not available. In traditional decision theoretic approaches, the utilities (or conversely costs) to report/not-report a fall or a non-fall are treated equally or the costs are deduced from the datasets, both of which are flawed. However, these costs are either difficult to compute or only available from domain experts. Therefore, in a typical fall detection system, we neither have a good model for falls nor an accurate estimate of utilities. In this thesis, we make contributions to handle both of these situations. In recent years, Hidden Markov Models (HMMs) have been used to model temporal dynamics of human activities. HMMs are generally built for normal activities and a threshold based on the log-likelihood of the training data is used to identify unseen falls. We show that such formulation to identify unseen fall activities is ill-posed for this problem. We present a new approach for the identification of falls using wearable devices in the absence of their training data but with plentiful data for normal Activities of Daily Living (ADL). We propose three 'X-Factor' Hidden Markov Model (XHMMs) approaches, which are similar to the traditional HMMs but have ``inflated'' output covariances (observation models). To estimate the inflated covariances, we propose a novel cross validation method to remove 'outliers' or deviant sequences from the ADL that serves as proxies for the unseen falls and allow learning the XHMMs using only normal activities. We tested the proposed XHMM approaches on three activity recognition datasets and show high detection rates for unseen falls. We also show that supervised classification methods perform poorly when very limited fall data is available during the training phase. We present a novel decision-theoretic approach to fall detection (dtFall) that aims to tackle the core problem when the model for falls and information about the costs/utilities associated with them is unavailable. We theoretically show that the expected regret will always be positive using dtFall instead of a maximum likelihood classifier. We present a new method to parameterize unseen falls such that training situations with no fall data can be handled. We also identify problems with theoretical thresholding to identify falls using decision theoretic modelling when training data for fall data is absent, and present an empirical thresholding technique to handle imperfect models for falls and non-falls. We also develop a new cost model based on severity of falls to provide an operational range of utilities. We present results on three activity recognition datasets, and show how the results may generalize to the difficult problem of fall detection in the real world. Under the condition when falls occur sporadically and rarely in the test set, the results show that (a) knowing the difference in the cost between a reported fall and a false alarm is useful, (b) as the cost of false alarm gets bigger this becomes more significant, and (c) the difference in the cost of between a reported and non-reported fall is not that useful