7,218 research outputs found

    Advocating better habitat use and selection models in bird ecology

    Get PDF
    Studies on habitat use and habitat selection represent a basic aspect of bird ecology, due to its importance in natural history, distribution, response to environmental changes, management and conservation. Basically, a statistical model that identifies environmental variables linked to a species presence is searched for. In this sense, there is a wide array of analytical methods that identify important explanatory variables within a model, with higher explanatory and predictive power than classical regression approaches. However, some of these powerful models are not widespread in ornithological studies, partly because of their complex theory, and in some cases, difficulties on their implementation and interpretation. Here, I describe generalized linear models and other five statistical models for the analysis of bird habitat use and selection outperforming classical approaches: generalized additive models, mixed effects models, occupancy models, binomial N-mixture models and decision trees (classification and regression trees, bagging, random forests and boosting). Each of these models has its benefits and drawbacks, but major advantages include dealing with non-normal distributions (presence-absence and abundance data typically found in habitat use and selection studies), heterogeneous variances, non-linear and complex relationships among variables, lack of statistical independence and imperfect detection. To aid ornithologists in making use of the methods described, a readable description of each method is provided, as well as a flowchart along with some recommendations to help them decide the most appropriate analysis. The use of these models in ornithological studies is encouraged, given their huge potential as statistical tools in bird ecology.Fil: Palacio, Facundo Xavier. Consejo Nacional de Investigaciones CientĂ­ficas y TĂŠcnicas; Argentina. Universidad Nacional de La Plata. Facultad de Ciencias Naturales y Museo. DivisiĂłn ZoologĂ­a de Vertebrados. SecciĂłn OrnitologĂ­a; Argentin

    Detecting multivariate interactions in spatial point patterns with Gibbs models and variable selection

    Get PDF
    We propose a method for detecting significant interactions in very large multivariate spatial point patterns. This methodology develops high dimensional data understanding in the point process setting. The method is based on modelling the patterns using a flexible Gibbs point process model to directly characterise point-to-point interactions at different spatial scales. By using the Gibbs framework significant interactions can also be captured at small scales. Subsequently, the Gibbs point process is fitted using a pseudo-likelihood approximation, and we select significant interactions automatically using the group lasso penalty with this likelihood approximation. Thus we estimate the multivariate interactions stably even in this setting. We demonstrate the feasibility of the method with a simulation study and show its power by applying it to a large and complex rainforest plant population data set of 83 species

    Distributed multinomial regression

    Full text link
    This article introduces a model-based approach to distributed computing for multinomial logistic (softmax) regression. We treat counts for each response category as independent Poisson regressions via plug-in estimates for fixed effects shared across categories. The work is driven by the high-dimensional-response multinomial models that are used in analysis of a large number of random counts. Our motivating applications are in text analysis, where documents are tokenized and the token counts are modeled as arising from a multinomial dependent upon document attributes. We estimate such models for a publicly available data set of reviews from Yelp, with text regressed onto a large set of explanatory variables (user, business, and rating information). The fitted models serve as a basis for exploring the connection between words and variables of interest, for reducing dimension into supervised factor scores, and for prediction. We argue that the approach herein provides an attractive option for social scientists and other text analysts who wish to bring familiar regression tools to bear on text data.Comment: Published at http://dx.doi.org/10.1214/15-AOAS831 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Boosting insights in insurance tariff plans with tree-based machine learning methods

    Full text link
    Pricing actuaries typically operate within the framework of generalized linear models (GLMs). With the upswing of data analytics, our study puts focus on machine learning methods to develop full tariff plans built from both the frequency and severity of claims. We adapt the loss functions used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros and varying exposure on the frequency side combined with scarce, but potentially long-tailed data on the severity side. A key requirement is the need for transparent and interpretable pricing models which are easily explainable to all stakeholders. We therefore focus on machine learning with decision trees: starting from simple regression trees, we work towards more advanced ensembles such as random forests and boosted trees. We show how to choose the optimal tuning parameters for these models in an elaborate cross-validation scheme, we present visualization tools to obtain insights from the resulting models and the economic value of these new modeling approaches is evaluated. Boosted trees outperform the classical GLMs, allowing the insurer to form profitable portfolios and to guard against potential adverse risk selection

    Private Trees as Household Assets and Determinants of Tree-Growing Behavior in Rural Ethiopia

    Get PDF
    This study looked into tree-growing behavior of rural households in Ethiopia. With data collected at household and parcel levels from the four major regions of Ethiopia, we analyzed the decision to grow trees and the number of trees grown, using such econometric strategies as a zero-inflated negative binomial model, Heckman’s two-step procedure, and panel data techniques. Our findings show the importance of analysis at the parcel level in addition to the more common household-level. Moreover, the empirical analysis indicates that the determinants of the decision to grow trees are not necessarily the same as those involved in deciding the number of trees grown. Land certification, as an indicator of tenure security, increases the likelihood that households will grow trees, but is not a significant determinant of the number of trees grown. Other variables, such as risk aversion, land size, adult male labor, and education of household head, also influence the number of trees grown. In general, the results suggest the need to use education and/or awareness of the role and importance of trees and point out the importance of household endowments and behavior, such as land, labor, and risk aversion, for tree growing. Finally, we observed that, while tree planting is practiced in all four regions covered, there are variations across regions.trees as assets, tree growing, Ethiopia

    Variable Selection for Nonparametric Gaussian Process Priors: Models and Computational Strategies

    Full text link
    This paper presents a unified treatment of Gaussian process models that extends to data from the exponential dispersion family and to survival data. Our specific interest is in the analysis of data sets with predictors that have an a priori unknown form of possibly nonlinear associations to the response. The modeling approach we describe incorporates Gaussian processes in a generalized linear model framework to obtain a class of nonparametric regression models where the covariance matrix depends on the predictors. We consider, in particular, continuous, categorical and count responses. We also look into models that account for survival outcomes. We explore alternative covariance formulations for the Gaussian process prior and demonstrate the flexibility of the construction. Next, we focus on the important problem of selecting variables from the set of possible predictors and describe a general framework that employs mixture priors. We compare alternative MCMC strategies for posterior inference and achieve a computationally efficient and practical approach. We demonstrate performances on simulated and benchmark data sets.Comment: Published in at http://dx.doi.org/10.1214/11-STS354 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Estimation and Regularization Techniques for Regression Models with Multidimensional Prediction Functions

    Get PDF
    Boosting is one of the most important methods for fitting regression models and building prediction rules from high-dimensional data. A notable feature of boosting is that the technique has a built-in mechanism for shrinking coefficient estimates and variable selection. This regularization mechanism makes boosting a suitable method for analyzing data characterized by small sample sizes and large numbers of predictors. We extend the existing methodology by developing a boosting method for prediction functions with multiple components. Such multidimensional functions occur in many types of statistical models, for example in count data models and in models involving outcome variables with a mixture distribution. As will be demonstrated, the new algorithm is suitable for both the estimation of the prediction function and regularization of the estimates. In addition, nuisance parameters can be estimated simultaneously with the prediction function

    VALUING IDAHO WINERIES WITH A TRAVEL COST MODEL

    Get PDF
    Many commercial wineries produce a dual product; commercial wine and wine tourism. Growth of wine tourism throughout the US has been phenomenal. In contrast to the price of wine, which is reflected in the market, the demand for wine tourism can be only ascertained with a shadow price for winery visitation. The demand for wine tourism visits for Canyon County in southern Idaho was estimated using the Travel Cost Method. The value of wine tourism in Canyon County was estimated to be $5.40 per person per trip and trip demand was highly inelastic at 0.5. Elasticities of other trip demand function variables were estimated and analyzed, with a view to informing the marketing of Idaho's emerging wine tourism industry.Community/Rural/Urban Development, Crop Production/Industries,

    Risk factor analysis and spatiotemporal CART model of cryptosporidiosis in Queensland, Australia

    Get PDF
    Background: It remains unclear whether it is possible to develop a spatiotemporal epidemic prediction model for cryptosporidiosis disease. This paper examined the impact of social economic and weather factors on cryptosporidiosis and explored the possibility of developing such a model using social economic and weather data in Queensland, Australia.Methods: Data on weather variables, notified cryptosporidiosis cases and social economic factors in Queensland were supplied by the Australian Bureau of Meteorology, Queensland Department of Health, and Australian Bureau of Statistics, respectively. Three-stage spatiotemporal classification and regression tree (CART) models were developed to examine the association between social economic and weather factors and monthly incidence of cryptosporidiosis in Queensland, Australia. The spatiotemporal CART model was used for predicting the outbreak of cryptosporidiosis in Queensland, Australia.Results: The results of the classification tree model (with incidence rates defined as binary presence/absence) showed that there was an 87% chance of an occurrence of cryptosporidiosis in a local government area (LGA) if the socio-economic index for the area (SEIFA) exceeded 1021, while the results of regression tree model (based on non-zero incidence rates) show when SEIFA was between 892 and 945, and temperature exceeded 32°C, the relative risk (RR) of cryptosporidiosis was 3.9 (mean morbidity: 390.6/100,000, standard deviation (SD): 310.5), compared to monthly average incidence of cryptosporidiosis. When SEIFA was less than 892 the RR of cryptosporidiosis was 4.3 (mean morbidity: 426.8/100,000, SD: 319.2). A prediction map for the cryptosporidiosis outbreak was made according to the outputs of spatiotemporal CART models.Conclusions: The results of this study suggest that spatiotemporal CART models based on social economic and weather variables can be used for predicting the outbreak of cryptosporidiosis in Queensland, Australia
    • …
    corecore