Search CORE

3,057 research outputs found

Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction

Author: Provost F.
Weiss G. M.
Publication venue: 'AI Access Foundation'
Publication date: 22/06/2011
Field of study

For large, real-world inductive learning problems, the number of training examples often must be limited due to the costs associated with procuring, preparing, and storing the training examples and/or the computational costs associated with learning from them. In such circumstances, one question of practical importance is: if only n training examples can be selected, in what proportion should the classes be represented? In this article we help to answer this question by analyzing, for a fixed training-set size, the relationship between the class distribution of the training data and the performance of classification trees induced from these data. We study twenty-six data sets and, for each, determine the best class distribution for learning. The naturally occurring class distribution is shown to generally perform well when classifier performance is evaluated using undifferentiated error rate (0/1 loss). However, when the area under the ROC curve is used to evaluate classifier performance, a balanced distribution is shown to perform well. Since neither of these choices for class distribution always generates the best-performing classifier, we introduce a budget-sensitive progressive sampling algorithm for selecting training examples based on the class associated with each example. An empirical analysis of this algorithm shows that the class distribution of the resulting training set yields classifiers with good (nearly-optimal) classification performance

arXiv.org e-Print Archive

Crossref

Robust Place Categorization With Deep Domain Generalization

Author: Caputo Barbara
Mancini Massimiliano
Ricci Elisa
Rota Bulò Samuel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2018
Field of study

Traditional place categorization approaches in robot vision assume that training and test images have similar visual appearance. Therefore, any seasonal, illumination, and environmental changes typically lead to severe degradation in performance. To cope with this problem, recent works have been proposed to adopt domain adaptation techniques. While effective, these methods assume that some prior information about the scenario where the robot will operate is available at training time. Unfortunately, in many cases, this assumption does not hold, as we often do not know where a robot will be deployed. To overcome this issue, in this paper, we present an approach that aims at learning classification models able to generalize to unseen scenarios. Specifically, we propose a novel deep learning framework for domain generalization. Our method develops from the intuition that, given a set of different classification models associated to known domains (e.g., corresponding to multiple environments, robots), the best model for a new sample in the novel domain can be computed directly at test time by optimally combining the known models. To implement our idea, we exploit recent advances in deep domain adaptation and design a convolutional neural network architecture with novel layers performing a weighted version of batch normalization. Our experiments, conducted on three common datasets for robot place categorization, confirm the validity of our contribution

arXiv.org e-Print Archive

Archivio della ricerca - Fondazione Bruno Kessler

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

Archivio della ricerca- Università di Roma La Sapienza

Asymptotic Analysis of Generative Semi-Supervised Learning

Author: Balasubramanian Krishnakumar
Dillon Joshua V
Lebanon Guy
Publication venue
Publication date: 01/01/2010
Field of study

Semisupervised learning has emerged as a popular framework for improving modeling accuracy while controlling labeling cost. Based on an extension of stochastic composite likelihood we quantify the asymptotic accuracy of generative semi-supervised learning. In doing so, we complement distribution-free analysis by providing an alternative framework to measure the value associated with different labeling policies and resolve the fundamental question of how much data to label and in what manner. We demonstrate our approach with both simulation studies and real world experiments using naive Bayes for text classification and MRFs and CRFs for structured prediction in NLP.Comment: 12 pages, 9 figure

arXiv.org e-Print Archive

CiteSeerX

Training and assessing classification rules with unbalanced data

Author: Menardi Giovanna
Torelli Nicola
Publication venue: EUT Edizioni Università di Trieste
Publication date: 01/01/2010
Field of study

The problem of modeling binary responses by using cross-sectional data has been addressed with a number of satisfying solutions that draw on both parametric and nonparametric methods. However, there exist many real situations where one of the two responses (usually the most interesting for the analysis) is rare. It has been largely reported that this class imbalance heavily compromises the process of learning, because the model tends to focus on the prevalent class and to ignore the rare events. However, not only the estimation of the classification model is affected by a skewed distribution of the classes, but also the evaluation of its accuracy is jeopardized, because the scarcity of data leads to poor estimates of the model’s accuracy. In this work, the effects of class imbalance on model training and model assessing are discussed. Moreover, a unified and systematic framework for dealing with both the problems is proposed, based on a smoothed bootstrap re-sampling technique

OpenstarTs

Estimation of soil types by non linear analysis of remote sensing data

Author: Gloaguen R.
Hahn C.
Publication venue: European Geosciences Union (EGU)
Publication date: 01/02/2008
Field of study

International audienceThe knowledge of soil type and soil texture is crucial for environmental monitoring purpose and risk assessment. Unfortunately, their mapping using classical techniques is time consuming and costly. We present here a way to estimate soil types based on limited field observations and remote sensing data. Due to the fact that the relation between the soil types and the considered attributes that were extracted from remote sensing data is expected to be non-linear, we apply Support Vector Machines (SVM) for soil type classification. Special attention is drawn to different training site distributions and the kind of input variables. We show that SVM based on carefully selected input variables proved to be an appropriate method for soil type estimation

Directory of Open Access Journals

HAL-INSU