162,853 research outputs found
Ratings and rankings: Voodoo or Science?
Composite indicators aggregate a set of variables using weights which are
understood to reflect the variables' importance in the index. In this paper we
propose to measure the importance of a given variable within existing composite
indicators via Karl Pearson's `correlation ratio'; we call this measure `main
effect'. Because socio-economic variables are heteroskedastic and correlated,
(relative) nominal weights are hardly ever found to match (relative) main
effects; we propose to summarize their discrepancy with a divergence measure.
We further discuss to what extent the mapping from nominal weights to main
effects can be inverted. This analysis is applied to five composite indicators,
including the Human Development Index and two popular league tables of
university performance. It is found that in many cases the declared importance
of single indicators and their main effect are very different, and that the
data correlation structure often prevents developers from obtaining the stated
importance, even when modifying the nominal weights in the set of nonnegative
numbers with unit sum.Comment: 28 pages, 7 figure
Recommended from our members
FAM222A encodes a protein which accumulates in plaques in Alzheimer's disease.
Alzheimer's disease (AD) is characterized by amyloid plaques and progressive cerebral atrophy. Here, we report FAM222A as a putative brain atrophy susceptibility gene. Our cross-phenotype association analysis of imaging genetics indicates a potential link between FAM222A and AD-related regional brain atrophy. The protein encoded by FAM222A is predominantly expressed in the CNS and is increased in brains of patients with AD and in an AD mouse model. It accumulates within amyloid deposits, physically interacts with amyloid-β (Aβ) via its N-terminal Aβ binding domain, and facilitates Aβ aggregation. Intracerebroventricular infusion or forced expression of this protein exacerbates neuroinflammation and cognitive dysfunction in an AD mouse model whereas ablation of this protein suppresses the formation of amyloid deposits, neuroinflammation and cognitive deficits in the AD mouse model. Our data support the pathological relevance of protein encoded by FAM222A in AD
Consistent distribution-free -sample and independence tests for univariate random variables
A popular approach for testing if two univariate random variables are
statistically independent consists of partitioning the sample space into bins,
and evaluating a test statistic on the binned data. The partition size matters,
and the optimal partition size is data dependent. While for detecting simple
relationships coarse partitions may be best, for detecting complex
relationships a great gain in power can be achieved by considering finer
partitions. We suggest novel consistent distribution-free tests that are based
on summation or maximization aggregation of scores over all partitions of a
fixed size. We show that our test statistics based on summation can serve as
good estimators of the mutual information. Moreover, we suggest regularized
tests that aggregate over all partition sizes, and prove those are consistent
too. We provide polynomial-time algorithms, which are critical for computing
the suggested test statistics efficiently. We show that the power of the
regularized tests is excellent compared to existing tests, and almost as
powerful as the tests based on the optimal (yet unknown in practice) partition
size, in simulations as well as on a real data example.Comment: arXiv admin note: substantial text overlap with arXiv:1308.155
Geographically intelligent disclosure control for flexible aggregation of census data
This paper describes a geographically intelligent approach to disclosure control for protecting flexibly aggregated census data. Increased analytical power has stimulated user demand for more detailed information for smaller geographical areas and customized boundaries. Consequently it is vital that improved methods of statistical disclosure control are developed to protect against the increased disclosure risk. Traditionally methods of statistical disclosure control have been aspatial in nature. Here we present a geographically intelligent approach that takes into account the spatial distribution of risk. We describe empirical work illustrating how the flexibility of this new method, called local density swapping, is an improved alternative to random record swapping in terms of risk-utility
Computing Multi-Relational Sufficient Statistics for Large Databases
Databases contain information about which relationships do and do not hold
among entities. To make this information accessible for statistical analysis
requires computing sufficient statistics that combine information from
different database tables. Such statistics may involve any number of {\em
positive and negative} relationships. With a naive enumeration approach,
computing sufficient statistics for negative relationships is feasible only for
small databases. We solve this problem with a new dynamic programming algorithm
that performs a virtual join, where the requisite counts are computed without
materializing join tables. Contingency table algebra is a new extension of
relational algebra, that facilitates the efficient implementation of this
M\"obius virtual join operation. The M\"obius Join scales to large datasets
(over 1M tuples) with complex schemas. Empirical evaluation with seven
benchmark datasets showed that information about the presence and absence of
links can be exploited in feature selection, association rule mining, and
Bayesian network learning.Comment: 11pages, 8 figures, 8 tables, CIKM'14,November 3--7, 2014, Shanghai,
Chin
Measuring Confidentiality Risks in Census Data
Two trends have been on a collision course over the recent past. The first is the increasing demand by researchers for greater detail and flexibility in outputs from the decennial Census of Population. The second is the need felt by the Census Offices to demonstrate more clearly that Census data have been explicitly protected from the risk of disclosure of information about individuals. To reconcile these competing trends the authors propose a statistical measure of risks of disclosure implicit in the release of aggregate census data. The ideas of risk measurement are first developed for microdata where there is prior experience and then modified to measure risk in tables of counts. To make sure that the theoretical ideas are fully expounded, the authors develop small worked example. The risk measure purposed here is currently being tested out with synthetic and a real Census microdata. It is hoped that this approach will both refocus the census confidentiality debate and contribute to the safe use of user defined flexible census output geographies
Measuring Confidentiality Risks in Census Data
Two trends have been on a collision course over the recent past. The first is the increasing demand by researchers for greater detail and flexibility in outputs from the decennial Census of Population. The second is the need felt by the Census Offices to demonstrate more clearly that Census data have been explicitly protected from the risk of disclosure of information about individuals. To reconcile these competing trends the authors propose a statistical measure of risks of disclosure implicit in the release of aggregate census data. The ideas of risk measurement are first developed for microdata where there is prior experience and then modified to measure risk in tables of counts. To make sure that the theoretical ideas are fully expounded, the authors develop small worked example. The risk measure purposed here is currently being tested out with synthetic and a real Census microdata. It is hoped that this approach will both refocus the census confidentiality debate and contribute to the safe use of user defined flexible census output geographies
Inconsistencies in Reported Employment Characteristics among Employed Stayers
The paper deals with measurement error, and its potentially distorting role, in information on industry and professional status collected by labour force surveys. The focus of our analyses is on inconsistent information on these employment characteristics resulting from yearly transition matrices for workers who were continuously employed over the year and who did not change job. As a case-study we use yearly panel data for the period from April 1993 to April 2003 collected by the Italian Quarterly Labour Force Survey. The analysis goes through four steps: (i) descriptive indicators of (dis)agreement; (ii) testing whether the consistency of repeated information significantly increases when the number of categories is collapsed; (iii) examination of the pattern of inconsistencies among response categories by means of Goodman's quasi-independence model; (iv) comparisons of alternative classifications. Results document sizable measurement error, which is only moderately reduced by more aggregated classifications. They suggest that even cross-section estimates of employment by industry and/or professional status are affected by non-random measurement error.industry, professional status, measurement errors, survey data
- âŚ