7,831 research outputs found
Revisiting Guerry's data: Introducing spatial constraints in multivariate analysis
Standard multivariate analysis methods aim to identify and summarize the main
structures in large data sets containing the description of a number of
observations by several variables. In many cases, spatial information is also
available for each observation, so that a map can be associated to the
multivariate data set. Two main objectives are relevant in the analysis of
spatial multivariate data: summarizing covariation structures and identifying
spatial patterns. In practice, achieving both goals simultaneously is a
statistical challenge, and a range of methods have been developed that offer
trade-offs between these two objectives. In an applied context, this
methodological question has been and remains a major issue in community
ecology, where species assemblages (i.e., covariation between species
abundances) are often driven by spatial processes (and thus exhibit spatial
patterns). In this paper we review a variety of methods developed in community
ecology to investigate multivariate spatial patterns. We present different ways
of incorporating spatial constraints in multivariate analysis and illustrate
these different approaches using the famous data set on moral statistics in
France published by Andr\'{e}-Michel Guerry in 1833. We discuss and compare the
properties of these different approaches both from a practical and theoretical
viewpoint.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS356 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining
Big data comes in various ways, types, shapes, forms and sizes. Indeed,
almost all areas of science, technology, medicine, public health, economics,
business, linguistics and social science are bombarded by ever increasing flows
of data begging to analyzed efficiently and effectively. In this paper, we
propose a rough idea of a possible taxonomy of big data, along with some of the
most commonly used tools for handling each particular category of bigness. The
dimensionality p of the input space and the sample size n are usually the main
ingredients in the characterization of data bigness. The specific statistical
machine learning technique used to handle a particular big data set will depend
on which category it falls in within the bigness taxonomy. Large p small n data
sets for instance require a different set of tools from the large n small p
variety. Among other tools, we discuss Preprocessing, Standardization,
Imputation, Projection, Regularization, Penalization, Compression, Reduction,
Selection, Kernelization, Hybridization, Parallelization, Aggregation,
Randomization, Replication, Sequentialization. Indeed, it is important to
emphasize right away that the so-called no free lunch theorem applies here, in
the sense that there is no universally superior method that outperforms all
other methods on all categories of bigness. It is also important to stress the
fact that simplicity in the sense of Ockham's razor non plurality principle of
parsimony tends to reign supreme when it comes to massive data. We conclude
with a comparison of the predictive performance of some of the most commonly
used methods on a few data sets.Comment: 18 pages, 2 figures 3 table
Evaluating Nonexperimental Estimators for Multiple Treatments: Evidence from Experimental Data
This paper assesses the e¤ectiveness of unconfoundedness-based estimators of mean e¤ects for multiple or multivalued treatments in eliminating biases arising from nonrandom treatment assignment. We evaluate these multiple treatment estimators by simultaneously equalizing average outcomes among several control groups from a randomized experiment. We study linear regression estimators as well as partial mean and weighting estimators based on the generalized propensity score (GPS). We also study the use of the GPS in assessing the comparability of individuals among the di¤erent treatment groups, and propose a strategy to determine the overlap or common support region that is less stringent than those previously used in the literature. Our results show that in the multiple treatment setting there may be treatment groups for which it is extremely di¢ cult to ?nd valid comparison groups, and that the GPS plays a signi?cant role in identifying those groups. In such situations, the estimators we consider perform poorly. However, their performance improves considerably once attention is restricted to those treatment groups with adequate overlap quality, with di¤erence-in-di¤erence estimators performing the best. Our results suggest that unconfoundedness-based estimators are a valuable econometric tool for evaluating multiple treatments, as long as the overlap quality is satisfactory.
Evaluating Nonexperimental Estimators for Multiple Treatments: Evidence from Experimental Data
This paper assesses the effectiveness of unconfoundedness-based estimators of mean effects for multiple or multivalued treatments in eliminating biases arising from nonrandom treatment assignment. We evaluate these multiple treatment estimators by simultaneously equalizing average outcomes among several control groups from a randomized experiment. We study linear regression estimators as well as partial mean and weighting estimators based on the generalized propensity score (GPS). We also study the use of the GPS in assessing the comparability of individuals among the different treatment groups, and propose a strategy to determine the overlap or common support region that is less stringent than those previously used in the literature. Our results show that in the multiple treatment setting there may be treatment groups for which it is extremely difficult to find valid comparison groups, and that the GPS plays a significant role in identifying those groups. In such situations, the estimators we consider perform poorly. However, their performance improves considerably once attention is restricted to those treatment groups with adequate overlap quality, with difference-in-difference estimators performing the best. Our results suggest that unconfoundedness-based estimators are a valuable econometric tool for evaluating multiple treatments, as long as the overlap quality is satisfactory.multiple treatments, nonexperimental estimators, generalized propensity score
Peer Effects in the Workplace: Evidence from Random Groupings in Professional Golf Tournaments
This paper uses the random assignment of playing partners in professional golf tournaments to test for peer effects in the workplace. We find no evidence that the ability of playing partners affects the performance of professional golfers, contrary to recent evidence on peer effects in the workplace from laboratory experiments, grocery scanners, and soft-fruit pickers. In our preferred specification, we can rule out peer effects larger than 0.045 strokes for a one stroke increase in playing partners' ability, and the point estimates are small and actually negative. We offer several explanations for our contrasting findings: that workers seek to avoid responding to social incentives when financial incentives are strong; that there is heterogeneity in how susceptible individuals are to social effects and that those who are able to avoid them are more likely to advance to elite professional labor markets; and that workers learn with professional experience not to be affected by social forces. We view our results as complementary to the existing studies of peer effects in the workplace and as a first step towards explaining how these social effects vary across labor markets, across individuals and with changes in the form of incentives faced. In addition to the empirical results on peer effects in the workplace, we also point out that many typical peer effects regressions are biased because individuals cannot be their own peers, and suggest a simple correction.
Temporal-varying failures of nodes in networks
We consider networks in which random walkers are removed because of the
failure of specific nodes. We interpret the rate of loss as a measure of the
importance of nodes, a notion we denote as failure-centrality. We show that the
degree of the node is not sufficient to determine this measure and that, in a
first approximation, the shortest loops through the node have to be taken into
account. We propose approximations of the failure-centrality which are valid
for temporal-varying failures and we dwell on the possibility of externally
changing the relative importance of nodes in a given network, by exploiting the
interference between the loops of a node and the cycles of the temporal pattern
of failures. In the limit of long failure cycles we show analytically that the
escape in a node is larger than the one estimated from a stochastic failure
with the same failure probability. We test our general formalism in two
real-world networks (air-transportation and e-mail users) and show how
communities lead to deviations from predictions for failures in hubs.Comment: 7 pages, 3 figure
Evaluating the methodology of social experiments
Welfare ; Econometric models
Indirect effects of an aid program: how do liquidity injections affect non-eligibles' consumption?
Aid programs in developing countries are likely to affect both the treated and the non-treated households living in the targeted areas. Studies that focus on the treatment effecton the treated may fail to capture important spillover effects. We exploit the unique designof an aid program's experimental trial to identify its indirect effect on consumption for non-eligible households living in treated areas. We find that this effect is positive, and that itoccurs through changes in the insurance and credit markets: non-eligible households receivemore transfers, and borrow more when hit by a negative idiosyncratic shock, because of theprogram liquidity injection; thus they can reduce their precautionary savings. We also testfor general equilibrium effects in the local labor and goods markets; we find no significantchanges in labor income and prices, while there is a reduction in earnings from sales ofagricultural products, which are now consumed rather than sold. We show that this classof aid programs has important positive externalities; thus their overall effect is larger thanthe effect on the treated. Our results confirm that a key identifying assumption - that thetreatment has no effect on the non-treated - is likely to be violated in similar policy designs. Aid programs in developing countries are likely to affect both the treated and the non-treated households living in the targeted areas. Studies that focus on the treatment effecton the treated may fail to capture important spillover effects. We exploit the unique designof an aid program's experimental trial to identify its indirect effect on consumption for non-eligible households living in treated areas. We find that this effect is positive, and that itoccurs through changes in the insurance and credit markets: non-eligible households receivemore transfers, and borrow more when hit by a negative idiosyncratic shock, because of theprogram liquidity injection; thus they can reduce their precautionary savings. We also testfor general equilibrium effects in the local labor and goods markets; we find no significantchanges in labor income and prices, while there is a reduction in earnings from sales ofagricultural products, which are now consumed rather than sold. We show that this classof aid programs has important positive externalities; thus their overall effect is larger thanthe effect on the treated. Our results confirm that a key identifying assumption - that thetreatment has no effect on the non-treated - is likely to be violated in similar policy designs
On the one-dimensional cubic nonlinear Schrodinger equation below L^2
In this paper, we review several recent results concerning well-posedness of
the one-dimensional, cubic Nonlinear Schrodinger equation (NLS) on the real
line R and on the circle T for solutions below the L^2-threshold. We point out
common results for NLS on R and the so-called "Wick ordered NLS" (WNLS) on T,
suggesting that WNLS may be an appropriate model for the study of solutions
below L^2(T). In particular, in contrast with a recent result of Molinet who
proved that the solution map for the periodic cubic NLS equation is not weakly
continuous from L^2(T) to the space of distributions, we show that this is not
the case for WNLS.Comment: 14 pages, additional reference
- …