19,623 research outputs found

    Scalable Bayesian nonparametric regression via a Plackett-Luce model for conditional ranks

    Full text link
    We present a novel Bayesian nonparametric regression model for covariates X and continuous, real response variable Y. The model is parametrized in terms of marginal distributions for Y and X and a regression function which tunes the stochastic ordering of the conditional distributions F(y|x). By adopting an approximate composite likelihood approach, we show that the resulting posterior inference can be decoupled for the separate components of the model. This procedure can scale to very large datasets and allows for the use of standard, existing, software from Bayesian nonparametric density estimation and Plackett-Luce ranking estimation to be applied. As an illustration, we show an application of our approach to a US Census dataset, with over 1,300,000 data points and more than 100 covariates

    Nonparametric Methods in Astronomy: Think, Regress, Observe -- Pick Any Three

    Get PDF
    Telescopes are much more expensive than astronomers, so it is essential to minimize required sample sizes by using the most data-efficient statistical methods possible. However, the most commonly used model-independent techniques for finding the relationship between two variables in astronomy are flawed. In the worst case they can lead without warning to subtly yet catastrophically wrong results, and even in the best case they require more data than necessary. Unfortunately, there is no single best technique for nonparametric regression. Instead, we provide a guide for how astronomers can choose the best method for their specific problem and provide a python library with both wrappers for the most useful existing algorithms and implementations of two new algorithms developed here.Comment: 19 pages, PAS

    An investigation into machine learning approaches for forecasting spatio-temporal demand in ride-hailing service

    Full text link
    In this paper, we present machine learning approaches for characterizing and forecasting the short-term demand for on-demand ride-hailing services. We propose the spatio-temporal estimation of the demand that is a function of variable effects related to traffic, pricing and weather conditions. With respect to the methodology, a single decision tree, bootstrap-aggregated (bagged) decision trees, random forest, boosted decision trees, and artificial neural network for regression have been adapted and systematically compared using various statistics, e.g. R-square, Root Mean Square Error (RMSE), and slope. To better assess the quality of the models, they have been tested on a real case study using the data of DiDi Chuxing, the main on-demand ride hailing service provider in China. In the current study, 199,584 time-slots describing the spatio-temporal ride-hailing demand has been extracted with an aggregated-time interval of 10 mins. All the methods are trained and validated on the basis of two independent samples from this dataset. The results revealed that boosted decision trees provide the best prediction accuracy (RMSE=16.41), while avoiding the risk of over-fitting, followed by artificial neural network (20.09), random forest (23.50), bagged decision trees (24.29) and single decision tree (33.55).Comment: Currently under review for journal publicatio

    Spatial Smoothing Techniques for the Assessment of Habitat Suitability

    Get PDF
    Precise knowledge about factors influencing the habitat suitability of a certain species forms the basis for the implementation of effective programs to conserve biological diversity. Such knowledge is frequently gathered from studies relating abundance data to a set of influential variables in a regression setup. In particular, generalised linear models are used to analyse binary presence/absence data or counts of a certain species at locations within an observation area. However, one of the key assumptions of generalised linear models, the independence of the observations is often violated in practice since the points at which the observations are collected are spatially aligned. While several approaches have been developed to analyse and account for spatial correlation in regression models with normally distributed responses, far less work has been done in the context of generalised linear models. In this paper, we describe a general framework for semiparametric spatial generalised linear models that allows for the routine analysis of non-normal spatially aligned regression data. The approach is utilised for the analysis of a data set of synthetic bird species in beech forests, revealing that ignorance of spatial dependence actually may lead to false conclusions in a number of situations
    corecore