1,629 research outputs found

    Analyzing establishment nonresponse using an interpretable regression tree model with linked administrative data

    Full text link
    To gain insight into how characteristics of an establishment are associated with nonresponse, a recursive partitioning algorithm is applied to the Occupational Employment Statistics May 2006 survey data to build a regression tree. The tree models an establishment's propensity to respond to the survey given certain establishment characteristics. It provides mutually exclusive cells based on the characteristics with homogeneous response propensities. This makes it easy to identify interpretable associations between the characteristic variables and an establishment's propensity to respond, something not easily done using a logistic regression propensity model. We test the model obtained using the May data against data from the November 2006 Occupational Employment Statistics survey. Testing the model on a disjoint set of establishment data with a very large sample size (n=179,360)(n=179,360) offers evidence that the regression tree model accurately describes the association between the establishment characteristics and the response propensity for the OES survey. The accuracy of this modeling approach is compared to that of logistic regression through simulation. This representation is then used along with frame-level administrative wage data linked to sample data to investigate the possibility of nonresponse bias. We show that without proper adjustments the nonresponse does pose a risk of bias and is possibly nonignorable.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS521 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Assessment of hydrological and seasonal controls over the nitrate flushing from a forested watershed using a data mining technique

    Get PDF
    A data mining, regression tree algorithm M5 was used to review the role of mutual hydrological and seasonal settings which control the streamwater nitrate flushing during hydrological events within a forested watershed in the southwestern part of Slovenia, characterized by distinctive flushing, almost torrential hydrological regime. The basis for the research was an extensive dataset of continuous, high frequency measurements of seasonal meteorological conditions, watershed hydrological responses and streamwater nitrate concentrations. The dataset contained 16 recorded hydrographs occurring in different seasonal and hydrological conditions. Based on predefined regression tree pruning criteria, a comprehensible regression tree model was obtained in the sense of the domain knowledge, which was able to adequately describe most of the streamwater nitrate concentration variations (RMSE=1.02mg/l-N; r=0.91). The attributes which were found to be the most descriptive in the sense of streamwater nitrate concentrations were the antecedent precipitation index (API) and air temperatures in the preceding periods. The model was most successful in describing streamwater concentrations in the range 1-4 mg/l-N, covering large proportion of the dataset. The model performance was little worse in the periods of high streamwater nitrate concentration peaks during the summer hydrographs (up to 7 mg/l-N) but poor during the autumn hydrograph (up to 14 mg/l-N) related to highly variable hydrological conditions, which would require a less robust regression tree model based on the extended dataset

    BAYESIAN ADDITIVE REGRESSION TREE APPLICATION FOR PREDICTING MATERNITY RECOVERY RATE OF GROUP LONG-TERM DISABILITY INSURANCE

    Get PDF
    Bayesian Additive Regression Tree (BART) is a sum-of-trees model used to approximate classification or regression cases. The main idea of this method is to use a prior distribution to keep the tree size small and a likelihood from data to get the posterior. By fixing the tree size as small as possible, the approximation of each tree would have a little effect on the posterior, which is the sum of all output from all the trees used. Bayesian additive regression tree method will be used for predicting the maternity recovery rate of group long-term disability insurance data from the Society of Actuaries (SOA). The decision tree-based models such as Gradient Boosting Machine, Random Forest, Decision Tree, and Bayesian Additive Regression Tree model are compared to find the best model by comparing mean squared error and program runtime. After comparing some models, the Bayesian Additive Regression Tree model gives the best prediction based on smaller root mean squared error values and relatively short runtime

    A Bayesian regression tree approach to identify the effect of nanoparticles' properties on toxicity profiles

    Full text link
    We introduce a Bayesian multiple regression tree model to characterize relationships between physico-chemical properties of nanoparticles and their in-vitro toxicity over multiple doses and times of exposure. Unlike conventional models that rely on data summaries, our model solves the low sample size issue and avoids arbitrary loss of information by combining all measurements from a general exposure experiment across doses, times of exposure, and replicates. The proposed technique integrates Bayesian trees for modeling threshold effects and interactions, and penalized B-splines for dose- and time-response surface smoothing. The resulting posterior distribution is sampled by Markov Chain Monte Carlo. This method allows for inference on a number of quantities of potential interest to substantive nanotoxicology, such as the importance of physico-chemical properties and their marginal effect on toxicity. We illustrate the application of our method to the analysis of a library of 24 nano metal oxides.Comment: Published at http://dx.doi.org/10.1214/14-AOAS797 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Evaluation of Flashover Voltage Levels of Contaminated Hydrophobic Polymer Insulators Using Regression Trees, Neural Networks, and Adaptive Neuro-Fuzzy

    Get PDF
    Polluted insulators at high voltages has acquired considerable importance with the rise of voltage transmission lines. The contamination may lead to flashover voltage. As a result, flashover voltage could lead to service outage and affects negatively the reliability of the power system. This paper presents a dynamic model of ac 50Hz flashover voltages of polluted hydrophobic polymer insulators. The models are constructed using the regression tree method, artificial neural network (ANN), and adaptive neuro-fuzzy (ANFIS). For this purpose, more than 2000 different experimental testing conditions were used to generate a training set. The study of the ac flashover voltages depends on silicone rubber (SiR) percentage content in ethylene propylene diene monomer (EPDM) rubber. Besides, water conductivity (μS/cm), number of droplets on the surface, and volume of water droplet (ml) are considered. The regression tree model is obtained and the performance of the proposed system with other intelligence methods is compa ed. It can be concluded that the performance of the least squares regression tree model outperforms the other intelligence methods, which gives the proposed model better generalization ability

    Impervious surface estimation using remote sensing images and gis : how accurate is the estimate at subdivision level?

    Get PDF
    Impervious surface has long been accepted as a key environmental indicator linking development to its impacts on water. Many have suggested that there is a direct correlation between degree of imperviousness and both quantity and quality of water. Quantifying the amount of impervious surface, however, remains difficult and tedious especially in urban areas. Lately more efforts have been focused on the application of remote sensing and GIS technologies in assessing the amount of impervious surface and many have reported promising results at various pixel levels. This paper discusses an attempt at estimating the amount of impervious surface at subdivision level using remote sensing images and GIS techniques. Using Landsat ETM+ images and GIS techniques, a regression tree model is first developed for estimating pixel imperviousness. GIS zonal functions are then used to estimate the amount of impervious surface for a sample of subdivisions. The accuracy of the model is evaluated by comparing the model-predicted imperviousness to digitized imperviousness at the subdivision level. The paper then concludes with a discussion on the convenience and accuracy of using the method to estimate imperviousness for large areas
    corecore