126,552 research outputs found

    Effect of Proportion of Missing Data on Application of Data Imputation in Pavement Management Systems

    Get PDF
    Missing data are commonly found in pavement condition/performance databases. A common practice today is to apply statistical imputation methods to replace the missing data with imputed values. It is thus important for pavement management decision makers to know the uncertainty and errors involved in the use of datasets with imputed values in their analysis. An equally important information of practical significance is the maximum allowable proportion of missing data (i.e. level of data missingness in the pavement condition/performance records) that will still produce results with acceptable magnitude of error or risk when using imputed data. This paper proposes a procedure for determining such useful information. A numerical example analyzing pavement roughness data is presented to demonstrate the procedure through evaluating the error and reliability characteristics of imputed data. The roughness data of three road sections were obtained from the LTPP database. From these data records, datasets with different proportions of missing data were randomly generated to study the effect of level of data missingness. The analysis shows that the errors of imputed data increased with the level of data missingness, and their magnitudes are significantly affected by the effect of pavement rehabilitation. On the application of data imputation in PMS, the study suggests that at 95% confidence level, 25% of missing data appears to be a reasonable allowable maximum limit for analyzing pavement roughness time series data not involving rehabilitation within the analysis period. When pavement rehabilitation occurs within the analysis period, the maximum proportion of imputed data should be limited to 15%

    Order selection tests with multiply-imputed data.

    Get PDF
    We develop nonparametric tests for the null hypothesis that a function has a prescribed form, to apply to data sets with missing observations. Omnibus nonparametric tests do not need to specify a particular alternative parametric form, and have power against a large range of alternatives, the order selection tests that we study are one example. We extend such order selection tests to be applicable in the context of missing data. In particular, we consider likelihood-based order selection tests for multiply- imputed data. A simulation study and data analysis illustrate the performance of the tests. A model selection method in the style of Akaike's information criterion for multiply imputed datasets results along the same lines.Akaike information criterion; Hypothesis test; Multiple imputation; lack-of-fit test; Missing data; Omnibus test; Order selection;

    Stochastic production frontiers with multiply imputed German establishment data

    Get PDF
    "In this paper, stochastic production frontier models are estimated with IAB establishment data from waves 2002 and 2003 to analyze productivity and inefficiency. The data suffer from nonresponse in the most important variables (output, capital and labor) leading to the loss of 25 % of the observations and possibly imprecise estimates and invalid test statistics. Therefore the missing values are multiply imputed. The analysis of the estimation results shows that, particularly in the inefficiency submodel, working with multiply imputed data reveals some interesting and plausible results which are not available when missing observations are ignored." (Author's abstract, IAB-Doku) ((en))IAB-Betriebspanel, Schätzung, Fehler, Datenaufbereitung, angewandte Statistik

    XSim: Simulation of Descendants from Ancestors with Sequence Data.

    Get PDF
    Real or imputed high-density SNP genotypes are routinely used for genomic prediction and genome-wide association studies. Many researchers are moving toward the use of actual or imputed next-generation sequence data in whole-genome analyses. Simulation studies are useful to mimic complex scenarios and test different analytical methods. We have developed the software tool XSim to efficiently simulate sequence data in descendants in arbitrary pedigrees. In this software, a strategy to drop-down origins and positions of chromosomal segments rather than every allele state is implemented to simulate sequence data and to accommodate complicated pedigree structures across multiple generations. Both C++ and Julia versions of XSim have been developed

    Integration of survey data and big observational data for finite population inference using mass imputation

    Get PDF
    Multiple data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we consider an imputation approach to combining a probability sample with big observational data. Unlike the usual imputation for missing data analysis, we create imputed values for the whole elements in the probability sample. Such mass imputation is attractive in the context of survey data integration (Kim and Rao, 2012). We extend mass imputation as a tool for data integration of survey data and big non-survey data. The mass imputation methods and their statistical properties are presented. The matching estimator of Rivers (2007) is also covered as a special case. Variance estimation with mass-imputed data is discussed. The simulation results demonstrate the proposed estimators outperform existing competitors in terms of robustness and efficiency

    Analyzing imputed financial data: a new approach to cluster analysis

    Get PDF
    The authors introduce a novel statistical modeling technique to cluster analysis and apply it to financial data. Their two main goals are to handle missing data and to find homogeneous groups within the data. Their approach is flexible and handles large and complex data structures with missing observations and with quantitative and qualitative measurements. The authors achieve this result by mapping the data to a new structure that is free of distributional assumptions in choosing homogeneous groups of observations. Their new method also provides insight into the number of different categories needed for classifying the data. The authors use this approach to partition a matched sample of stocks. One group offers dividend reinvestment plans, and the other does not. Their method partitions this sample with almost 97 percent accuracy even when using only easily available financial variables. One interpretation of their result is that the misclassified companies are the best candidates either to adopt a dividend reinvestment plan (if they have none) or to abandon one (if they currently offer one). The authors offer other suggestions for applications in the field of finance.
    corecore