17,881 research outputs found
Recommended from our members
Quantification under class-conditional dataset shift
Classification is the estimation of the class of each instance in a dataset, quantification is the estimation of the number of instances of each class in a dataset. Quantification methods typically assume that the data which is being quantified has the same class-conditional distribution as the data on which the quantifier was trained. This thesis addresses the situation where this assumption cannot be made, where there is class-conditional dataset shift between the training data and the test data. The work was motivated by sentiment analysis tasks using tweets on Twitter. By selecting users based on the content of their tweet, the users cannot be considered to have been randomly drawn from the population. In this thesis, domain adaptation methods from classification have been applied to the problem of quantification. Separating the data into explicit sub-domains and quantifying each sub-domain separately can increase quantification accuracy but under certain conditions it can also decrease it. An expression for expected quantification error was derived in closed-form with some simplifying assumptions. In tests on real datasets, a method based on this approach gave a modest improvement to quantification accuracy. Constructing a new feature representation has proved successful for domain adaptation in classification. An approach using Stacked Denoising Autoencoders to generate a new feature representation gave a 3.3% relative improvement in quantification accuracy. Finally, a method based on using Kernel Mean Matching for weighting instances in the training set gave a relative improvement in quantification accuracy of 10.7%. Experiments were conducted on publicly available datasets and also on a custom dataset of Twitter users
The Law of Total Odds
The law of total probability may be deployed in binary classification
exercises to estimate the unconditional class probabilities if the class
proportions in the training set are not representative of the population class
proportions. We argue that this is not a conceptually sound approach and
suggest an alternative based on the new law of total odds. We quantify the bias
of the total probability estimator of the unconditional class probabilities and
show that the total odds estimator is unbiased. The sample version of the total
odds estimator is shown to coincide with a maximum-likelihood estimator known
from the literature. The law of total odds can also be used for transforming
the conditional class probabilities if independent estimates of the
unconditional class probabilities of the population are available.
Keywords: Total probability, likelihood ratio, Bayes' formula, binary
classification, relative odds, unbiased estimator, supervised learning, dataset
shift.Comment: 12 pages, 1 figure, new reference
- …