473 research outputs found

    The Influence Function of Penalized Regression Estimators

    Full text link
    To perform regression analysis in high dimensions, lasso or ridge estimation are a common choice. However, it has been shown that these methods are not robust to outliers. Therefore, alternatives as penalized M-estimation or the sparse least trimmed squares (LTS) estimator have been proposed. The robustness of these regression methods can be measured with the influence function. It quantifies the effect of infinitesimal perturbations in the data. Furthermore it can be used to compute the asymptotic variance and the mean squared error. In this paper we compute the influence function, the asymptotic variance and the mean squared error for penalized M-estimators and the sparse LTS estimator. The asymptotic biasedness of the estimators make the calculations nonstandard. We show that only M-estimators with a loss function with a bounded derivative are robust against regression outliers. In particular, the lasso has an unbounded influence function.Comment: appears in Statistics: A Journal of Theoretical and Applied Statistics, 201

    The shooting S-estimator for robust regression

    Full text link
    To perform multiple regression, the least squares estimator is commonly used. However, this estimator is not robust to outliers. Therefore, robust methods such as S-estimation have been proposed. These estimators flag any observation with a large residual as an outlier and downweight it in the further procedure. However, a large residual may be caused by an outlier in only one single predictor variable, and downweighting the complete observation results in a loss of information. Therefore, we propose the shooting S-estimator, a regression estimator that is especially designed for situations where a large number of observations suffer from contamination in a small number of predictor variables. The shooting S-estimator combines the ideas of the coordinate descent algorithm with simple S-regression, which makes it robust against componentwise contamination, at the cost of failing the regression equivariance property

    robustHD: An R package for robust regression with high-dimensional data

    Get PDF

    An Object-Oriented Framework for Statistical Simulation: The R Package simFrame

    Get PDF
    Simulation studies are widely used by statisticians to gain insight into the quality of developed methods. Usually some guidelines regarding, e.g., simulation designs, contamination, missing data models or evaluation criteria are necessary in order to draw meaningful conclusions. The R package simFrame is an object-oriented framework for statistical simulation, which allows researchers to make use of a wide range of simulation designs with a minimal effort of programming. Its object-oriented implementation provides clear interfaces for extensions by the user. Since statistical simulation is an embarrassingly parallel process, the framework supports parallel computing to increase computational performance. Furthermore, an appropriate plot method is selected automatically depending on the structure of the simulation results. In this paper, the implementation of simFrame is discussed in great detail and the functionality of the framework is demonstrated in examples for different simulation designs.

    Generating a Close-to-Reality Synthetic Population of Ghana

    Get PDF
    The purpose of this research is to generate a close-to-reality synthetic human population for use in a geosimulation of urban dynamics. Two commonly accepted approaches to generating synthetic human populations are Iterative Proportional Fitting (IPF) and Resampling with Replacement. While these methods are effective at reproducing one instance of the probability model describing the survey, it is an instance with extremely small variability amongst subgroups and is very unlikely to be the real population. IPF and Resampling with Replacement also rely on pure replication of units from the underlying sample which can increase unrealistic model behavior. In this work we present a sequential logic for estimating variables using multinomial logistic regressions and the conditional probabilities amongst each variable in order to generate combinations which were not represented in the original survey but are likely to occur in the real population. We also present a model based approach to imputing missing observation responses and apply the methodology to the Ghana Living Standard Survey 5 (GLSS5) in order to generate a comprehensive synthetic population for the Republic of Ghana, including such household and person variables as household size, tribal affiliation, educational attainment and annual income, amongst others. The R language and environment for statistical computing was used as well as the packages VIM and simPopulation in developing and executing the code. Contingency coefficients, cumulative distributions, mosaic plots, and box plots are presented for evaluation in order to demonstrate the effectiveness of the new method in its application to Ghana

    Cost Efficient Tillage and Rotation Options for Mitigating GHG Emissions from Agriculture in Eastern Canada

    Get PDF
    The economic efficiency of cropping options to mitigate GHG emissions with agriculture in Eastern Canada was analyzed. Data on yield response to tillage (moldboard plow and chisel plow) and six corn based rotations were obtained from a 20-year field experiment in Ontario. Budgets were constructed for each cropping system while GHG emissions were measured for soil carbon and were estimated for nitrous oxide according to IPCC methodology. Complex crop rotations with legumes, such as corn-corn-soybeans-wheat with red clover underseeded, have higher net returns and substantially (more than 1 Mg ha1 year1) lower GHG emissions than continuous corn. Reduced tillage reduces GHG emissions due to lower input use but no sequestration effect could be found in the soil from tillage. Rotation had a much bigger effect on the mitigation potential of GHG emissions than tillage. However, opportunity costs of more than $200 per Mg CO2 eq ha1 year1 indicate the limits to increase the mitigation potential beyond the level of the economic best cropping system.Environmental Economics and Policy,

    Simulation a Close-to-Reality Synthetic Population of the Greater Accra Region

    Get PDF
    The purpose of this research is to simulate a synthetic population of the Greater Accra Metropolitan Region (GAMA) from the 2005 Ghana Living Standards Survey (GLSS5) for use in the Greater Accra Urban Simulation System (GAUSS). A primary goal in simulating the synthetic population of GAMA is to employ a method which generates close-to-reality population data rather than repeatedly drawing samples. In order to generate close-to-reality synthetic data, combinations which were not represented in the original household survey but are likely to occur in the true population must occur in the synthetically generated data. The author estimates the conditional distributions with multinomial logistic regression models in order to simulate categorical and continuous variables. The simulation of random zeros as opposed to structural zeros, are also reflected in the synthetically generated Greater Accra population. One of the main purposes for avoiding pure replication of units from the underlying sample is because this generally leads to small variability of units within smaller subgroups, which results in an increase in unrealistic model behavior when population data is used as input for agent-based simulations of urban dynamics

    RadixSpline: A Single-Pass Learned Index

    Full text link
    Recent research has shown that learned models can outperform state-of-the-art index structures in size and lookup performance. While this is a very promising result, existing learned structures are often cumbersome to implement and are slow to build. In fact, most approaches that we are aware of require multiple training passes over the data. We introduce RadixSpline (RS), a learned index that can be built in a single pass over the data and is competitive with state-of-the-art learned index models, like RMI, in size and lookup performance. We evaluate RS using the SOSD benchmark and show that it achieves competitive results on all datasets, despite the fact that it only has two parameters.Comment: Third International Workshop on Exploiting Artificial Intelligence Techniques for Data Management (aiDM 2020
    • …
    corecore