Search CORE

1,476 research outputs found

Aggregative quantification for regression

Author: Bella Sanjuán Antonio
Ferri Ramírez César
Hernández Orallo José
Ramírez Quintana María José
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/03/2014
Field of study

The final publication is available at Springer via http://dx.doi.org/10.1007/s10618-013-0308-zThe problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.We would like to thank the anonymous reviewers for their careful reviews, insightful comments and very useful suggestions. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROME-TEO/2008/051, the COST-European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Ministerio de Economia y Competitividad in Spain.Bella Sanjuán, A.; Ferri Ramírez, C.; Hernández Orallo, J.; Ramírez Quintana, MJ. (2014). Aggregative quantification for regression. Data Mining and Knowledge Discovery. 28(2):475-518. https://doi.org/10.1007/s10618-013-0308-zS475518282Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, HersheyBella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742Bella A, Ferri C, Hernández-Orallo J, Ramírez-Quintana MJ (2012) On the effect of calibration in classifier combination. Appl Intell. doi: 10.1007/s10489-012-0388-2Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, CambridgeForman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/mlGonzález-Castro V, Alaiz-Rodríguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, BerlinHernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869Hodges J, Lehmann E (1963) Estimates of location based on rank tests. Ann Math Stat 34(5):598–611Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New YorkHwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336Moreno-Torres J, Raeder T, Alaiz-Rodríguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaTenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, AmsterdamXiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the Cramér–von Mises two-sample test. J Stat Softw 17:1–15Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the Cramér-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash UniversityZadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–69

RiuNet

Online Optimization Methods for the Quantification Problem

Author: Cesa-Bianchi Nicoló
Gentile Claudio
Kar Purushottam
Kar Purushottam
Marthinus
Narasimhan Harikrishna
Parambath Shameem P.
Shalev-Shwartz Shai
Zalinescu Constantin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/06/2016
Field of study

The estimation of class prevalence, i.e., the fraction of a population that belongs to a certain class, is a very useful tool in data analytics and learning, and finds applications in many domains such as sentiment analysis, epidemiology, etc. For example, in sentiment analysis, the objective is often not to estimate whether a specific text conveys a positive or a negative sentiment, but rather estimate the overall distribution of positive and negative sentiments during an event window. A popular way of performing the above task, often dubbed quantification, is to use supervised learning to train a prevalence estimator from labeled data. Contemporary literature cites several performance measures used to measure the success of such prevalence estimators. In this paper we propose the first online stochastic algorithms for directly optimizing these quantification-specific performance measures. We also provide algorithms that optimize hybrid performance measures that seek to balance quantification and classification performance. Our algorithms present a significant advancement in the theory of multivariate optimization and we show, by a rigorous theoretical analysis, that they exhibit optimal convergence. We also report extensive experiments on benchmark and real data sets which demonstrate that our methods significantly outperform existing optimization techniques used for these performance measures.Comment: 26 pages, 6 figures. A short version of this manuscript will appear in the proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 201

arXiv.org e-Print Archive

Crossref

Multi-Label Quantification

Author: Aparicio Manuel Francisco
Moreo Fernández Alejandro
Sebastiani Fabrizio
Publication venue: Association for Computing Machinery
Publication date: 01/06/2023
Field of study

The work of A. Moreo and F. Sebastiani has been supported by the SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, by the AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020, and by the SoBigData.it and FAIR projects funded by the Italian Ministry of University and Research under the NextGenerationEU program; the authors’ opinions do not necessarily reflect those of the funding agencies. The work of M. Francisco has been supported by the FPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant BES-2017-081202.Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of interest in unlabelled data samples. While many quantification methods have been proposed in the past for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e., the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored. A straightforward solution to the multi-label quantification problem could simply consist of recasting the problem as a set of independent binary quantification problems. Such a solution is simple but naïve, since the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing the relative frequency of one class could be of help in determining the prevalence of other related classes. We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label solutions outperform the naïve approaches by a large margin. The code to reproduce all our experiments is available online.SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020SoBigData.it and FAIR projects funded by the Italian Ministry of University and Research under the NextGenerationEU programPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant BES-2017-08120

ZENODO

Repositorio Institucional Universidad de Granada

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Recommended from our members

Systematic analysis of the Hippo pathway organization and oncogenic alteration in evolution.

Author: Chen Yuxuan
Chuc Kimberly
Han Han
Seo Gayoung
Vargas Rebecca Elizabeth
Wang Wenqi
Yang Bing
Zhao Huabin
Publication venue: eScholarship, University of California
Publication date: 21/02/2020
Field of study

The Hippo pathway is a central regulator of organ size and a key tumor suppressor via coordinating cell proliferation and death. Initially discovered in Drosophila, the Hippo pathway has been implicated as an evolutionarily conserved pathway in mammals; however, how this pathway was evolved to be functional from its origin is still largely unknown. In this study, we traced the Hippo pathway in premetazoan species, characterized the intrinsic functions of its ancestor components, and unveiled the evolutionary history of this key signaling pathway from its unicellular origin. In addition, we elucidated the paralogous gene history for the mammalian Hippo pathway components and characterized their cancer-derived somatic mutations from an evolutionary perspective. Taken together, our findings not only traced the conserved function of the Hippo pathway to its unicellular ancestor components, but also provided novel evolutionary insights into the Hippo pathway organization and oncogenic alteration

eScholarship - University of California

Recommended from our members

How good are the fits to the experimental velocity profiles in vivo?

Author: 3rd Micro and Nano Flows Conference (MNF2011)
Giannoukas AD
Koutsiaris AG
Tachmitzi SV
Publication venue: Brunel University
Publication date: 01/01/2011
Field of study

This paper was presented at the 3rd Micro and Nano Flows Conference (MNF2011), which was held at the Makedonia Palace Hotel, Thessaloniki in Greece. The conference was organised by Brunel University and supported by the Italian Union of Thermofluiddynamics, Aristotle University of Thessaloniki, University of Thessaly, IPEM, the Process Intensification Network, the Institution of Mechanical Engineers, the Heat Transfer Society, HEXAG - the Heat Exchange Action Group, and the Energy Institute.A new velocity profile equation for the description of microcirculatory blood flow in vivo was proposed in 2009. However various recently published papers still use the assumption of parabolic velocity profile (Poiseuille flow). The purpose of this work was to evaluate the performance of 3 different fitting cases: 1) best parabolic fit, 2) axial fit with the proposed equation and 3) best fit with the proposed equation. Twelve experimental velocity profiles measured by particle image velocimetry in mouse venules were used to compare the fitting efficiency of the 3 cases on the basis of the velocity relative error (RE) expressed as average ± SE (standard error) at ten different radial segments (REj with 1 ≤ j ≤ 10). The parabolic best fit (case 1) leads to serious deviations from the real velocity distribution (RE10 = - 65% ± 2%). The proposed equation axial fit (case 2) slightly overestimates blood velocity distribution near the vessel wall but the was below + 12% and it requires only one experimental value near the vessel axis, measurable using the Doppler Effect. The proposed equation best fit (case 3) approximates the experimental data without any serious bias but requires a complete velocity profile data set

Brunel University Research Archive

Cliophysics: Socio-political Reliability Theory, Polity Duration and African Political (In)stabilities

Author: A Cherif
Alhaji Cherif
B Heldt
D Collett
DA Hibbs
DL Horowitz
E Weede
Enrico Scalas
G Casper
GurrKJ Ted Robert
J Auvinen
J Levy
K Jaggers
Kamal Barley
M Bratton
M Krain
M Ward
MG Marshall
MI Lichbach
MI Lichbach
P Collier
RA Francisco
S Huntington
SC Zanger
T Ellingsen
T Vanhanen
T Vanhanen
TJ Ellingsen
TNPG Ellingsen
TR Gurr
TR Gurr
W Moore
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2010
Field of study

Quantification of historical sociological processes have recently gained attention among theoreticians in the effort of providing a solid theoretical understanding of the behaviors and regularities present in sociopolitical dynamics. Here we present a reliability theory of polity processes with emphases on individual political dynamics of African countries. We found that the structural properties of polity failure rates successfully capture the risk of political vulnerability and instabilities in which 87.50%, 75%, 71.43%, and 0% of the countries with monotonically increasing, unimodal, U-shaped and monotonically decreasing polity failure rates, respectively, have high level of state fragility indices. The quasi-U-shape relationship between average polity duration and regime types corroborates historical precedents and explains the stability of the autocracies and democracies.Comment: 4 pages, 3 figures, 1 tabl

arXiv.org e-Print Archive

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals