8,641 research outputs found

    Robust regression with imprecise data

    Get PDF
    We consider the problem of regression analysis with imprecise data. By imprecise data we mean imprecise observations of precise quantities in the form of sets of values. In this paper, we explore a recently introduced likelihood-based approach to regression with such data. The approach is very general, since it covers all kinds of imprecise data (i.e. not only intervals) and it is not restricted to linear regression. Its result consists of a set of functions, reflecting the entire uncertainty of the regression problem. Here we study in particular a robust special case of the likelihood-based imprecise regression, which can be interpreted as a generalization of the method of least median of squares. Moreover, we apply it to data from a social survey, and compare it with other approaches to regression with imprecise data. It turns out that the likelihood-based approach is the most generally applicable one and is the only approach accounting for multiple sources of uncertainty at the same time

    Linear regression for numeric symbolic variables: an ordinary least squares approach based on Wasserstein Distance

    Full text link
    In this paper we present a linear regression model for modal symbolic data. The observed variables are histogram variables according to the definition given in the framework of Symbolic Data Analysis and the parameters of the model are estimated using the classic Least Squares method. An appropriate metric is introduced in order to measure the error between the observed and the predicted distributions. In particular, the Wasserstein distance is proposed. Some properties of such metric are exploited to predict the response variable as direct linear combination of other independent histogram variables. Measures of goodness of fit are discussed. An application on real data corroborates the proposed method

    The first analytical expression to estimate photometric redshifts suggested by a machine

    Get PDF
    We report the first analytical expression purely constructed by a machine to determine photometric redshifts (zphotz_{\rm phot}) of galaxies. A simple and reliable functional form is derived using 41,21441,214 galaxies from the Sloan Digital Sky Survey Data Release 10 (SDSS-DR10) spectroscopic sample. The method automatically dropped the uu and zz bands, relying only on gg, rr and ii for the final solution. Applying this expression to other 1,417,1811,417,181 SDSS-DR10 galaxies, with measured spectroscopic redshifts (zspecz_{\rm spec}), we achieved a mean ⟨(zphot−zspec)/(1+zspec)⟩≲0.0086\langle (z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})\rangle\lesssim 0.0086 and a scatter σ(zphot−zspec)/(1+zspec)≲0.045\sigma_{(z_{\rm phot} - z_{\rm spec})/(1+z_{\rm spec})}\lesssim 0.045 when averaged up to z≲1.0z \lesssim 1.0. The method was also applied to the PHAT0 dataset, confirming the competitiveness of our results when faced with other methods from the literature. This is the first use of symbolic regression in cosmology, representing a leap forward in astronomy-data-mining connection.Comment: 6 pages, 4 figures. Accepted for publication in MNRAS Letter

    Likelihood-based Imprecise Regression

    Get PDF
    We introduce a new approach to regression with imprecisely observed data, combining likelihood inference with ideas from imprecise probability theory, and thereby taking different kinds of uncertainty into account. The approach is very general and applicable to various kinds of imprecise data, not only to intervals. In the present paper, we propose a regression method based on this approach, where no parametric distributional assumption is needed and interval estimates of quantiles of the error distribution are used to identify plausible descriptions of the relationship of interest. Therefore, the proposed regression method is very robust. We apply our robust regression method to an interesting question in the social sciences. The analysis, based on survey data, yields a relatively imprecise result, reflecting the high amount of uncertainty inherent in the analyzed data set
    • …
    corecore