15 research outputs found

    A review onquantification learning

    Get PDF
    The task of quantification consists in providing an aggregate estimation (e.g. the class distribution in a classification problem) for unseen test sets, applying a model that is trained using a training set with a different data distribution. Several real-world applications demand this kind of methods that do not require predictions for individual examples and just focus on obtaining accurate estimates at an aggregate level. During the past few years, several quantification methods have been proposed from different perspectives and with different goals. This paper presents a unified review of the main approaches with the aim of serving as an introductory tutorial for newcomers in the fiel

    Continuous Sweep: an improved, binary quantifier

    Full text link
    Quantification is a supervised machine learning task, focused on estimating the class prevalence of a dataset rather than labeling its individual observations. We introduce Continuous Sweep, a new parametric binary quantifier inspired by the well-performing Median Sweep. Median Sweep is currently one of the best binary quantifiers, but we have changed this quantifier on three points, namely 1) using parametric class distributions instead of empirical distributions, 2) optimizing decision boundaries instead of applying discrete decision rules, and 3) calculating the mean instead of the median. We derive analytic expressions for the bias and variance of Continuous Sweep under general model assumptions. This is one of the first theoretical contributions in the field of quantification learning. Moreover, these derivations enable us to find the optimal decision boundaries. Finally, our simulation study shows that Continuous Sweep outperforms Median Sweep in a wide range of situations

    Dataset Shift and the Adjustment of Probabilistic Classifiers

    Get PDF
    Klassifitseerimine on masinĂ”ppe ĂŒlesanne, kus igale andmepunktiletuleb tema tunnuste pĂ”hjal mÀÀrata klass. TĂ”enĂ€osuslik klassifitseerimine onkitsam ĂŒlesanne, kus kĂ”ikidele vĂ”imalikele klassidele tuleb mÀÀrata iga andmepunkti puhul tĂ”enĂ€osus, mis nĂ€itaks klassifitseerija enesekindlust andmepunktile antud klassi mÀÀramisel.Klassikalises masinĂ”ppes eeldatakse, et kĂ”ik andmepunktid, mida kasutatakseklassifitseerija treenimiseks vĂ”i testimiseks on valitud sĂ”ltumatult ja samasttunnuste ja mĂ€rgendite ĂŒhisjaotusest. See on aga pĂ€riselulistes rakendustes vĂ€ga ebatĂ”enĂ€oline, kuna sageli andmete jaotus muutub aja jooksul. Muutust andmete jaotuses klassifitseerija treenimise ja hilisema rakendamise vahel tuntakse kui andmenihet.Antud töös pakutakse vĂ€lja uus meetod mistahes selliste tĂ”enĂ€osuslike klassifitseerijate töö parandamiseks, mille puhul on andmetes klassijaotust muutev nihe - omadus, mis on enamikel andmenihetel. VĂ€lja pakutud meetod baseerub kohandamise protsessil, mille kĂ€igus sobitatakse tĂ”enĂ€osusliku klassifitseerija oodatav vĂ€ljund andmete klassijaotusega. Varasemas töös on nĂ€idatud, et kohandamine vĂ€hendab oodatavat kahju keskmise ruutvea ja KL-divergentsi puhul.Need kaks kaofunktsiooni on osa laiemast funktsioonide perest, mida kutsutakse puhasteks skoorireegliteks.VĂ€lja pakutud protseduuri kutsume edaspidi ĂŒldiseks kohandamiseks, kuna seevĂ€hendab oodatavat kahju kĂ”ikide puhaste skoorireeglite korral. Üldisel kohandamisel on kaks variatsiooni: piiramata ja piiratud. Piiramata ĂŒldine kohandamine annab keskmise ruutvea ja KL-divergentsi korral sama tulemuse nagu juba eksisteerivad kohandamise protseduurid. Piiratud ĂŒldine kohandamine on tĂ€iendus, mis vĂ€hendab oodatavat kahju vĂ€hemalt sama palju vĂ”i rohkem kui piiramata versioon. MĂ”lemad meetodid lahenduvad kui kumerad minimiseerimisĂŒlesanded ning on seega arvutuslikult efektiivsed.Eksperimentide tulemused nĂ€itavad, et piiratud ĂŒldine kohandamine vĂ€hendabkahju praktilistes olukordades, kus uue andmejaotuse klassijaotus ei pruugi ollatĂ€pselt teada. Isegi mÔÔduka veaga hinnatud klassijaotuse korral suudab piiratud ĂŒldine kohandamine enamikel juhtudel kahju vĂ€hendada.Classification is the machine learning problem of assigning a class toa given instance of data defined by a set of features. Probabilistic classificationis the stricter problem of assigning probabilities to each possible class given aninstance, indicating the classifiers confidence in that class being correct for thegiven instance.The underlying assumption of classical machine learning is that any instance used to train or test the classifier is sampled independently and identically distributed from the same joint probability distribution of features and labels. This, however, is a very unlikely situation in real world applications, as the distribution of data frequently changes over time. The change in the distribution of data between the time of training the classifier and a future point in the classifier’s life cycle (testing, deployment, etc.) is known as dataset shift.In this thesis, a novel procedure is presented which improves the performanceof a probabilistic classifier experiencing any pattern of shift that causes the class distribution to change, a property most patterns of shift share. This new technique is based off of adjustment, the process of matching the probabilistic classifier’s expected output to the class distribution of the data. In previous works it has been shown that adjustment can be used to reduce expected loss for mean squared error and KL divergence. These two loss functions are a part of a wider family of loss functions called proper scoring rules.The proposed novel procedure is termed general adjustment, since it reducesexpected loss for all proper scoring rules. It comes in two varieties, unboundedand bounded. Unbounded general adjustment gives results equivalent to the previously described adjustment procedures for mean squared error and KL divergence.Bounded general adjustment is a further refinement, reducing expected loss asmuch or more than its unbounded form. Both are convex minimization tasks, andtherefore computationally efficient to compute.The results of a series of experiments show that bounded general adjustmentreduces loss in a practical setting, where the exact value of the new class distribution may not be known. Even with moderate error in the estimation of the new class distribution, bounded general adjustment still reduces loss in most cases
    corecore