162,821 research outputs found

    Ratings and rankings: Voodoo or Science?

    Full text link
    Composite indicators aggregate a set of variables using weights which are understood to reflect the variables' importance in the index. In this paper we propose to measure the importance of a given variable within existing composite indicators via Karl Pearson's `correlation ratio'; we call this measure `main effect'. Because socio-economic variables are heteroskedastic and correlated, (relative) nominal weights are hardly ever found to match (relative) main effects; we propose to summarize their discrepancy with a divergence measure. We further discuss to what extent the mapping from nominal weights to main effects can be inverted. This analysis is applied to five composite indicators, including the Human Development Index and two popular league tables of university performance. It is found that in many cases the declared importance of single indicators and their main effect are very different, and that the data correlation structure often prevents developers from obtaining the stated importance, even when modifying the nominal weights in the set of nonnegative numbers with unit sum.Comment: 28 pages, 7 figure

    Consistent distribution-free KK-sample and independence tests for univariate random variables

    Full text link
    A popular approach for testing if two univariate random variables are statistically independent consists of partitioning the sample space into bins, and evaluating a test statistic on the binned data. The partition size matters, and the optimal partition size is data dependent. While for detecting simple relationships coarse partitions may be best, for detecting complex relationships a great gain in power can be achieved by considering finer partitions. We suggest novel consistent distribution-free tests that are based on summation or maximization aggregation of scores over all partitions of a fixed size. We show that our test statistics based on summation can serve as good estimators of the mutual information. Moreover, we suggest regularized tests that aggregate over all partition sizes, and prove those are consistent too. We provide polynomial-time algorithms, which are critical for computing the suggested test statistics efficiently. We show that the power of the regularized tests is excellent compared to existing tests, and almost as powerful as the tests based on the optimal (yet unknown in practice) partition size, in simulations as well as on a real data example.Comment: arXiv admin note: substantial text overlap with arXiv:1308.155

    Computing Multi-Relational Sufficient Statistics for Large Databases

    Full text link
    Databases contain information about which relationships do and do not hold among entities. To make this information accessible for statistical analysis requires computing sufficient statistics that combine information from different database tables. Such statistics may involve any number of {\em positive and negative} relationships. With a naive enumeration approach, computing sufficient statistics for negative relationships is feasible only for small databases. We solve this problem with a new dynamic programming algorithm that performs a virtual join, where the requisite counts are computed without materializing join tables. Contingency table algebra is a new extension of relational algebra, that facilitates the efficient implementation of this M\"obius virtual join operation. The M\"obius Join scales to large datasets (over 1M tuples) with complex schemas. Empirical evaluation with seven benchmark datasets showed that information about the presence and absence of links can be exploited in feature selection, association rule mining, and Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai, Chin

    Geographically intelligent disclosure control for flexible aggregation of census data

    No full text
    This paper describes a geographically intelligent approach to disclosure control for protecting flexibly aggregated census data. Increased analytical power has stimulated user demand for more detailed information for smaller geographical areas and customized boundaries. Consequently it is vital that improved methods of statistical disclosure control are developed to protect against the increased disclosure risk. Traditionally methods of statistical disclosure control have been aspatial in nature. Here we present a geographically intelligent approach that takes into account the spatial distribution of risk. We describe empirical work illustrating how the flexibility of this new method, called local density swapping, is an improved alternative to random record swapping in terms of risk-utility

    Measuring Confidentiality Risks in Census Data

    Get PDF
    Two trends have been on a collision course over the recent past. The first is the increasing demand by researchers for greater detail and flexibility in outputs from the decennial Census of Population. The second is the need felt by the Census Offices to demonstrate more clearly that Census data have been explicitly protected from the risk of disclosure of information about individuals. To reconcile these competing trends the authors propose a statistical measure of risks of disclosure implicit in the release of aggregate census data. The ideas of risk measurement are first developed for microdata where there is prior experience and then modified to measure risk in tables of counts. To make sure that the theoretical ideas are fully expounded, the authors develop small worked example. The risk measure purposed here is currently being tested out with synthetic and a real Census microdata. It is hoped that this approach will both refocus the census confidentiality debate and contribute to the safe use of user defined flexible census output geographies

    Measuring Confidentiality Risks in Census Data

    Get PDF
    Two trends have been on a collision course over the recent past. The first is the increasing demand by researchers for greater detail and flexibility in outputs from the decennial Census of Population. The second is the need felt by the Census Offices to demonstrate more clearly that Census data have been explicitly protected from the risk of disclosure of information about individuals. To reconcile these competing trends the authors propose a statistical measure of risks of disclosure implicit in the release of aggregate census data. The ideas of risk measurement are first developed for microdata where there is prior experience and then modified to measure risk in tables of counts. To make sure that the theoretical ideas are fully expounded, the authors develop small worked example. The risk measure purposed here is currently being tested out with synthetic and a real Census microdata. It is hoped that this approach will both refocus the census confidentiality debate and contribute to the safe use of user defined flexible census output geographies

    Inconsistencies in Reported Employment Characteristics among Employed Stayers

    Get PDF
    The paper deals with measurement error, and its potentially distorting role, in information on industry and professional status collected by labour force surveys. The focus of our analyses is on inconsistent information on these employment characteristics resulting from yearly transition matrices for workers who were continuously employed over the year and who did not change job. As a case-study we use yearly panel data for the period from April 1993 to April 2003 collected by the Italian Quarterly Labour Force Survey. The analysis goes through four steps: (i) descriptive indicators of (dis)agreement; (ii) testing whether the consistency of repeated information significantly increases when the number of categories is collapsed; (iii) examination of the pattern of inconsistencies among response categories by means of Goodman's quasi-independence model; (iv) comparisons of alternative classifications. Results document sizable measurement error, which is only moderately reduced by more aggregated classifications. They suggest that even cross-section estimates of employment by industry and/or professional status are affected by non-random measurement error.industry, professional status, measurement errors, survey data
    • …
    corecore