12,232 research outputs found

    The Data Quality Concept of Accuracy in the Context of Public Use Data Sets

    Get PDF
    Like other data quality dimensions, the concept of accuracy is often adopted to characterise a particular data set. However, its common specification basically refers to statistical properties of estimators, which can hardly be proved by means of a single survey at hand. This ambiguity can be resolved by assigning 'accuracy' to survey processes that are known to affect these properties. In this contribution, we consider the sub-process of imputation as one important step in setting up a data set and argue that the so called 'hit-rate' criterion, that is intended to measure the accuracy of a data set by some distance function of 'true' but unobserved and imputed values, is neither required nor desirable. In contrast, the so-called 'inference' criterion allows for valid inferences based on a suitably completed data set under rather general conditions. The underlying theoretical concepts are illustrated by means of a simulation study. It is emphasised that the same principal arguments apply to other survey processes that introduce uncertainty into an edited data set.Survey Quality, Survey Processes, Accuracy, Assessment of Imputation Methods, Multiple Imputation

    Representative Wealth Data for Germany from the German SOEP: The Impact of Methodological Decisions around Imputation and the Choice of the Aggregation Unit

    Get PDF
    The definition and operationalization of wealth information in population surveys and the corresponding microdata requires a wide range of more or less normative assumptions. However, the decisions made in both the pre- and post-data-collection stage may interfere consid-erably with the substantive research question. Looking at wealth data from the German SOEP, this paper focuses on the impact of collecting information at the individual rather than house-hold level, and on "imputation and editing" as a means of dealing with measurement error. First, we assess how the choice of unit of aggregation or unit of analysis affects wealth distri-bution and inequality analysis. Obviously, when measured in "per capita household" terms, wealth is less unequally distributed than at the individual level. This is the result of significant redistribution within households, and also provides evidence of a significant persisting gender wealth gap. Secondly, we find multiple imputation to be an effective means of coping with selective non-response. There is a significant impact of imputation on the share of wealth holders (increas-ing on average by 15%) and also on aggregate wealth (plus 30%). However, with respect to inequality, the results are ambiguous. Looking at the major outcome variable for the whole population-net worth-the Gini coefficient decreases, whereas a top-sensitive measure dou-bles. The non-random selectivity built into the missing process and the consideration of this selectivity in the imputation process clearly contribute to this finding. Obviously, the treatment of measurement errors after data collection, especially with respect to the imputation of missing values, affects cross-national comparability and thus may require some cross-national harmonization of the imputation strategies applied to the various national datasets.Wealth, Item non-response, multiple imputation, SOEP

    Representative Wealth Data for Germany from the German SOEP: The Impact of Methodological Decisions around Imputation and the Choice of the Aggregation Unit

    Get PDF
    The definition and operationalization of wealth information in population surveys and the corresponding microdata requires a wide range of more or less normative assumptions. However, the decisions made in both the pre- and post-data-collection stage may interfere considerably with the substantive research question. Looking at wealth data from the German SOEP, this paper focuses on the impact of collecting information at the individual rather than household level, and on "imputation and editing" as a means of dealing with measurement error. First, we assess how the choice of unit of aggregation or unit of analysis affects wealth distribution and inequality analysis. Obviously, when measured in "per capita household" terms, wealth is less unequally distributed than at the individual level. This is the result of significant redistribution within households, and also provides evidence of a significant persisting gender wealth gap. Secondly, we find multiple imputation to be an effective means of coping with selective nonresponse. There is a significant impact of imputation on the share of wealth holders (increasing on average by 15%) and also on aggregate wealth (plus 30%). However, with respect to inequality, the results are ambiguous. Looking at the major outcome variable for the whole population-net worth-the Gini coefficient decreases, whereas a top-sensitive measure doubles. The non-random selectivity built into the missing process and the consideration of this selectivity in the imputation process clearly contribute to this finding. Obviously, the treatment of measurement errors after data collection, especially with respect to the imputation of missing values, affects cross-national comparability and thus may require some cross-national harmonization of the imputation strategies applied to the various national datasets.Wealth, Item Non-response, Multiple Imputation, SOEP

    Calibrated imputation of numerical data under linear edit restrictions

    No full text
    A common problem faced by statistical offices is that data may be missing from collected data sets. The typical way to overcome this problem is to impute the missing data. The problem of imputing missing data is complicated by the fact that statistical data often have to satisfy certain edit rules and that values of variables sometimes have to sum up to known totals. Standard imputation methods for numerical data as described in the literature generally do not take such edit rules and totals into account. In the paper we describe algorithms for imputation of missing numerical data that do take edit restrictions into account and that ensure that sums are calibrated to known totals. The methods sequentially impute the missing data, i.e. the variables with missing values are imputed one by one. To assess the performance of the imputation methods a simulation study is carried out as well as an evaluation study based on a real dataset

    Use of partial least squares regression to impute SNP genotypes in Italian Cattle breeds

    Get PDF
    Background The objective of the present study was to test the ability of the partial least squares regression technique to impute genotypes from low density single nucleotide polymorphisms (SNP) panels i.e. 3K or 7K to a high density panel with 50K SNP. No pedigree information was used. Methods Data consisted of 2093 Holstein, 749 Brown Swiss and 479 Simmental bulls genotyped with the Illumina 50K Beadchip. First, a single-breed approach was applied by using only data from Holstein animals. Then, to enlarge the training population, data from the three breeds were combined and a multi-breed analysis was performed. Accuracies of genotypes imputed using the partial least squares regression method were compared with those obtained by using the Beagle software. The impact of genotype imputation on breeding value prediction was evaluated for milk yield, fat content and protein content. Results In the single-breed approach, the accuracy of imputation using partial least squares regression was around 90 and 94% for the 3K and 7K platforms, respectively; corresponding accuracies obtained with Beagle were around 85% and 90%. Moreover, computing time required by the partial least squares regression method was on average around 10 times lower than computing time required by Beagle. Using the partial least squares regression method in the multi-breed resulted in lower imputation accuracies than using single-breed data. The impact of the SNP-genotype imputation on the accuracy of direct genomic breeding values was small. The correlation between estimates of genetic merit obtained by using imputed versus actual genotypes was around 0.96 for the 7K chip. Conclusions Results of the present work suggested that the partial least squares regression imputation method could be useful to impute SNP genotypes when pedigree information is not available

    Editing and multiply imputing German establishment panel data to estimate stochastic production frontier models

    Get PDF
    "This paper illustrates the effects of item-nonresponse in surveys on the results of multivariate statistical analysis when estimation of productivity is the task. To multiply impute the missing data a data augmentation algorithm based on a normal/Wishart model is applied. Data of the German IAB Establishment Panel from waves 2000 and 2001 are used to estimate the establishment's productivity. The processes of constructing, editing, and transforming the variables needed for the analyst's as well as the imputer's models are described. It is shown that standard multiple imputation techniques can be used to estimate sophisticated econometric models from large-scale panel data exposed to item-nonresponse. Basis of the empirical analysis is a stochastic production frontier model with labour and capital as input factors. The results show that a model of technical inefficiency is favoured compared to a case where we assume different production functions in East and West Germany. Also we see that the effect of regional setting on technical inefficiency increases when inference is based on multiply imputed data sets. These results may stimulate future research and could have influence on the economic and regional policies in Germany. " (Author's abstract, IAB-Doku) ((en))IAB-Betriebspanel, Befragung, Antwortverhalten, ProduktivitÀt, Datenanalyse, SchÀtzung, betriebliche Kennzahlen, Imputationsverfahren

    Electrostatic Field Classifier for Deficient Data

    Get PDF
    This paper investigates the suitability of recently developed models based on the physical field phenomena for classification problems with incomplete datasets. An original approach to exploiting incomplete training data with missing features and labels, involving extensive use of electrostatic charge analogy, has been proposed. Classification of incomplete patterns has been investigated using a local dimensionality reduction technique, which aims at exploiting all available information rather than trying to estimate the missing values. The performance of all proposed methods has been tested on a number of benchmark datasets for a wide range of missing data scenarios and compared to the performance of some standard techniques. Several modifications of the original electrostatic field classifier aiming at improving speed and robustness in higher dimensional spaces are also discussed
    • 

    corecore