1,476 research outputs found
Aggregative quantification for regression
The final publication is available at Springer via http://dx.doi.org/10.1007/s10618-013-0308-zThe problem of estimating the class distribution (or prevalence) for a new unlabelled dataset (from a possibly different distribution) is a very common problem which has been addressed in one way or another in the past decades. This problem has been recently reconsidered as a new task in data mining, renamed quantification when the estimation is performed as an aggregation (and possible adjustment) of a single-instance supervised model (e.g., a classifier). However, the study of quantification has been limited to classification, while it is clear that this problem also appears, perhaps even more frequently, with other predictive problems, such as regression. In this case, the goal is to determine a distribution or an aggregated indicator of the output variable for a new unlabelled dataset. In this paper, we introduce a comprehensive new taxonomy of quantification tasks, distinguishing between the estimation of the whole distribution and the estimation of some indicators (summary statistics), for both classification and regression. This distinction is especially useful for regression, since predictions are numerical values that can be aggregated in many different ways, as in multi-dimensional hierarchical data warehouses. We focus on aggregative quantification for regression and see that the approaches borrowed from classification do not work. We present several techniques based on segmentation which are able to produce accurate estimations of the expected value and the distribution of the output variable. We show experimentally that these methods especially excel for the relevant scenarios where training and test distributions dramatically differ.We would like to thank the anonymous reviewers for their careful reviews, insightful comments and very useful suggestions. This work was supported by the MEC/MINECO projects CONSOLIDER-INGENIO CSD2007-00022 and TIN 2010-21062-C02-02, GVA project PROME-TEO/2008/051, the COST-European Cooperation in the field of Scientific and Technical Research IC0801 AT, and the REFRAME project granted by the European Coordinated Research on Long-term Challenges in Information and Communication Sciences & Technologies ERA-Net (CHIST-ERA), and funded by the Ministerio de Economia y Competitividad in Spain.Bella Sanjuán, A.; Ferri RamĂrez, C.; Hernández Orallo, J.; RamĂrez Quintana, MJ. (2014). Aggregative quantification for regression. Data Mining and Knowledge Discovery. 28(2):475-518. https://doi.org/10.1007/s10618-013-0308-zS475518282Alonzo TA, Pepe MS, Lumley T (2003) Estimating disease prevalence in two-phase studies. Biostatistics 4(2):313–326Anderson T (1962) On the distribution of the two-sample Cramer–von Mises criterion. Ann Math Stat 33(3):1148–1159Bakar AA, Othman ZA, Shuib NLM (2009) Building a new taxonomy for data discretization techniques. In: Proceedings of 2nd conference on data mining and optimization (DMO’09), pp 132–140Bella A, Ferri C, Hernández-Orallo J, RamĂrez-Quintana MJ (2009a) Calibration of machine learning models. In: Handbook of research on machine learning applications. IGI Global, HersheyBella A, Ferri C, Hernández-Orallo J, RamĂrez-Quintana MJ (2009b) Similarity-binning averaging: a generalisation of binning calibration. In: International conference on intelligent data engineering and automated learning. LNCS, vol 5788. Springer, Berlin, pp 341–349Bella A, Ferri C, Hernández-Orallo J, RamĂrez-Quintana MJ (2010) Quantification via probability estimators. In: International conference on data mining, ICDM2010, pp 737–742Bella A, Ferri C, Hernández-Orallo J, RamĂrez-Quintana MJ (2012) On the effect of calibration in classifier combination. Appl Intell. doi: 10.1007/s10489-012-0388-2Chan Y, Ng H (2006) Estimating class priors in domain adaptation for word sense disambiguation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics, pp 89–96Chawla N, Japkowicz N, Kotcz A (2004) Editorial: special issue on learning from imbalanced data sets. ACM SIGKDD Explor Newsl 6(1):1–6Demsar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: Prieditis A, Russell S (eds) Proceedings of the twelfth international conference on machine learning. Morgan Kaufmann, San Francisco, pp 194–202Ferri C, Hernández-Orallo J, Modroiu R (2009) An experimental comparison of performance measures for classification. Pattern Recogn Lett 30(1):27–38Flach P (2012) Machine learning: the art and science of algorithms that make sense of data. Cambridge University Press, CambridgeForman G (2005) Counting positives accurately despite inaccurate classification. In: Proceedings of the 16th European conference on machine learning (ECML), pp 564–575Forman G (2006) Quantifying trends accurately despite classifier error and class imbalance. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (KDD), pp 157–166Forman G (2008) Quantifying counts and costs via classification. Data Min Knowl Discov 17(2):164–206Frank A, Asuncion A (2010) UCI machine learning repository. http://archive.ics.uci.edu/mlGonzález-Castro V, Alaiz-RodrĂguez R, Alegre E (2012) Class distribution estimation based on the Hellinger distance. Inf Sci 218(1):146–164Hastie TJ, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, BerlinHernández-Orallo J, Flach P, Ferri C (2012) A unified view of performance metrics: translating threshold choice into expected classification loss. J Mach Learn Res (JMLR) 13:2813–2869Hodges J, Lehmann E (1963) Estimates of location based on rank tests. Ann Math Stat 34(5):598–611Hosmer DW, Lemeshow S (2000) Applied logistic regression. Wiley, New YorkHwang JN, Lay SR, Lippman A (1994) Nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10):2795–2810Hyndman RJ, Bashtannyk DM, Grunwald GK (1996) Estimating and visualizing conditional densities. J Comput Graph Stat 5(4):315–336Moreno-Torres J, Raeder T, Alaiz-RodrĂguez R, Chawla N, Herrera F (2012) A unifying view on dataset shift in classification. Pattern Recogn 45(1):521–530Neyman J (1938) Contribution to the theory of sampling human populations. J Am Stat Assoc 33(201):101–116Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74Raeder T, Forman G, Chawla N (2012) Learning from imbalanced data: evaluation matters. Data Min 23:315–331Sánchez L, González V, Alegre E, Alaiz R (2008) Classification and quantification based on image analysis for sperm samples with uncertain damaged/intact cell proportions. In: Proceedings of the 5th international conference on image analysis and recognition. LNCS, vol 5112. Springer, Heidelberg, pp 827–836Sturges H (1926) The choice of a class interval. J Am Stat Assoc 21(153):65–66Team R et al (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, ViennaTenenbein A (1970) A double sampling scheme for estimating from binomial data with misclassifications. J Am Stat Assoc 65(331):1350–1361Weiss G (2004) Mining with rarity: a unifying framework. ACM SIGKDD Explor Newsl 6(1):7–19Weiss G, Provost F (2001) The effect of class distribution on classifier learning: an empirical study. Technical Report ML-TR-44Witten IH, Frank E (2005) Data mining: practical machine learning tools and techniques with Java implementations. Elsevier, AmsterdamXiao Y, Gordon A, Yakovlev A (2006a) A C++ program for the CramĂ©r–von Mises two-sample test. J Stat Softw 17:1–15Xiao Y, Gordon A, Yakovlev A (2006b) The L1-version of the CramĂ©r-von Mises test for two-sample comparisons in microarray data analysis. EURASIP J Bioinform Syst Biol 2006:85769Xue J, Weiss G (2009) Quantification and semi-supervised classification methods for handling changes in class distribution. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining, pp 897–906Yang Y (2003) Discretization for naive-bayes learning. PhD thesis, Monash UniversityZadrozny B, Elkan C (2001) Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In: Proceedings of the 8th international conference on machine learning (ICML), pp 609–616Zadrozny B, Elkan C (2002) Transforming classifier scores into accurate multiclass probability estimates. In: The 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 694–69
Online Optimization Methods for the Quantification Problem
The estimation of class prevalence, i.e., the fraction of a population that
belongs to a certain class, is a very useful tool in data analytics and
learning, and finds applications in many domains such as sentiment analysis,
epidemiology, etc. For example, in sentiment analysis, the objective is often
not to estimate whether a specific text conveys a positive or a negative
sentiment, but rather estimate the overall distribution of positive and
negative sentiments during an event window. A popular way of performing the
above task, often dubbed quantification, is to use supervised learning to train
a prevalence estimator from labeled data.
Contemporary literature cites several performance measures used to measure
the success of such prevalence estimators. In this paper we propose the first
online stochastic algorithms for directly optimizing these
quantification-specific performance measures. We also provide algorithms that
optimize hybrid performance measures that seek to balance quantification and
classification performance. Our algorithms present a significant advancement in
the theory of multivariate optimization and we show, by a rigorous theoretical
analysis, that they exhibit optimal convergence. We also report extensive
experiments on benchmark and real data sets which demonstrate that our methods
significantly outperform existing optimization techniques used for these
performance measures.Comment: 26 pages, 6 figures. A short version of this manuscript will appear
in the proceedings of the 22nd ACM SIGKDD Conference on Knowledge Discovery
and Data Mining, KDD 201
Multi-Label Quantification
The work of A. Moreo and F. Sebastiani has been supported by the SoBigData++ project, funded
by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1, by
the AI4Media project, funded by the European Commission (Grant 951911) under the H2020
Programme ICT-48-2020, and by the SoBigData.it and FAIR projects funded by the Italian Ministry
of University and Research under the NextGenerationEU program; the authors’ opinions do not
necessarily reflect those of the funding agencies. The work of M. Francisco has been supported by
the FPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness
(MINECO), grant BES-2017-081202.Quantification, variously called supervised prevalence estimation or learning to quantify, is the supervised
learning task of generating predictors of the relative frequencies (a.k.a. prevalence values) of the classes of
interest in unlabelled data samples. While many quantification methods have been proposed in the past
for binary problems and, to a lesser extent, single-label multiclass problems, the multi-label setting (i.e.,
the scenario in which the classes of interest are not mutually exclusive) remains by and large unexplored.
A straightforward solution to the multi-label quantification problem could simply consist of recasting the
problem as a set of independent binary quantification problems. Such a solution is simple but naĂŻve, since
the independence assumption upon which it rests is, in most cases, not satisfied. In these cases, knowing
the relative frequency of one class could be of help in determining the prevalence of other related classes.
We propose the first truly multi-label quantification methods, i.e., methods for inferring estimators of class
prevalence values that strive to leverage the stochastic dependencies among the classes of interest in order
to predict their relative frequencies more accurately. We show empirical evidence that natively multi-label
solutions outperform the naĂŻve approaches by a large margin. The code to reproduce all our experiments is
available online.SoBigData++ project, funded by the European Commission (Grant 871042) under the H2020 Programme INFRAIA-2019-1AI4Media project, funded by the European Commission (Grant 951911) under the H2020 Programme ICT-48-2020SoBigData.it and FAIR projects funded by the Italian Ministry of University and Research under the NextGenerationEU programPI 2017 predoctoral programme, from the Spanish Ministry of Economy and Competitiveness (MINECO), grant BES-2017-08120
Recommended from our members
Systematic analysis of the Hippo pathway organization and oncogenic alteration in evolution.
The Hippo pathway is a central regulator of organ size and a key tumor suppressor via coordinating cell proliferation and death. Initially discovered in Drosophila, the Hippo pathway has been implicated as an evolutionarily conserved pathway in mammals; however, how this pathway was evolved to be functional from its origin is still largely unknown. In this study, we traced the Hippo pathway in premetazoan species, characterized the intrinsic functions of its ancestor components, and unveiled the evolutionary history of this key signaling pathway from its unicellular origin. In addition, we elucidated the paralogous gene history for the mammalian Hippo pathway components and characterized their cancer-derived somatic mutations from an evolutionary perspective. Taken together, our findings not only traced the conserved function of the Hippo pathway to its unicellular ancestor components, but also provided novel evolutionary insights into the Hippo pathway organization and oncogenic alteration
Recommended from our members
How good are the fits to the experimental velocity profiles in vivo?
This paper was presented at the 3rd Micro and Nano Flows Conference (MNF2011), which was held at the Makedonia Palace Hotel, Thessaloniki in Greece. The conference was organised by Brunel University and supported by the Italian Union of Thermofluiddynamics, Aristotle University of Thessaloniki, University of Thessaly, IPEM, the Process Intensification Network, the Institution of Mechanical Engineers, the Heat Transfer Society, HEXAG - the Heat Exchange Action Group, and the Energy Institute.A new velocity profile equation for the description of microcirculatory blood flow in vivo was proposed in 2009. However various recently published papers still use the assumption of parabolic velocity
profile (Poiseuille flow). The purpose of this work was to evaluate the performance of 3 different fitting cases: 1) best parabolic fit, 2) axial fit with the proposed equation and 3) best fit with the proposed equation.
Twelve experimental velocity profiles measured by particle image velocimetry in mouse venules were used to compare the fitting efficiency of the 3 cases on the basis of the velocity relative error (RE) expressed as average ± SE (standard error) at ten different radial segments (REj with 1 ≤ j ≤ 10). The parabolic best fit (case 1) leads to serious deviations from the real velocity distribution (RE10 = - 65% ± 2%). The proposed equation axial fit (case 2) slightly overestimates blood velocity distribution near the vessel wall but the
was below + 12% and it requires only one experimental value near the vessel axis, measurable using the Doppler Effect. The proposed equation best fit (case 3) approximates the experimental data without any serious bias but requires a complete velocity profile data set
Cliophysics: Socio-political Reliability Theory, Polity Duration and African Political (In)stabilities
Quantification of historical sociological processes have recently gained
attention among theoreticians in the effort of providing a solid theoretical
understanding of the behaviors and regularities present in sociopolitical
dynamics. Here we present a reliability theory of polity processes with
emphases on individual political dynamics of African countries. We found that
the structural properties of polity failure rates successfully capture the risk
of political vulnerability and instabilities in which 87.50%, 75%, 71.43%, and
0% of the countries with monotonically increasing, unimodal, U-shaped and
monotonically decreasing polity failure rates, respectively, have high level of
state fragility indices. The quasi-U-shape relationship between average polity
duration and regime types corroborates historical precedents and explains the
stability of the autocracies and democracies.Comment: 4 pages, 3 figures, 1 tabl
- …