27,498 research outputs found
On crowdsourcing relevance magnitudes for information retrieval evaluation
4siMagnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents for information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting over 50,000 magnitude estimation judgments using crowdsourcing. Our analysis shows that magnitude estimation judgments can be reliably collected using crowdsourcing, are competitive in terms of assessor cost, and are, on average, rank-aligned with ordinal judgments made by expert relevance assessors. We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance: different users have different perceptions of the impact of relative differences in document relevance. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold.partially_openopenMaddalena, Eddy; Mizzaro, Stefano; Scholer, Falk; Turpin, AndrewMaddalena, Eddy; Mizzaro, Stefano; Scholer, Falk; Turpin, Andre
Judgments of effort exerted by others are influenced by received rewards
Estimating invested effort is a core dimension for evaluating own and others’ actions, and views on the relationship between effort and rewards are deeply ingrained in various societal attitudes. Internal representations of effort, however, are inherently noisy, e.g. due to the variability of sensorimotor and visceral responses to physical exertion. The uncertainty in effort judgments is further aggravated when there is no direct access to the internal representations of exertion – such as when estimating the effort of another person. Bayesian cue integration suggests that this uncertainty can be resolved by incorporating additional cues that are predictive of effort, e.g. received rewards. We hypothesized that judgments about the effort spent on a task will be influenced by the magnitude of received rewards. Additionally, we surmised that such influence might further depend on individual beliefs regarding the relationship between hard work and prosperity, as exemplified by a conservative work ethic. To test these predictions, participants performed an effortful task interleaved with a partner and were informed about the obtained reward before rating either their own or the partner’s effort. We show that higher rewards led to higher estimations of exerted effort in self-judgments, and this effect was even more pronounced for other-judgments. In both types of judgment, computational modelling revealed that reward information and sensorimotor markers of exertion were combined in a Bayes-optimal manner in order to reduce uncertainty. Remarkably, the extent to which rewards influenced effort judgments was associated with conservative world-views, indicating links between this phenomenon and general beliefs about the relationship between effort and earnings in society
Crowdsourcing Relevance: Two Studies on Assessment
Crowdsourcing has become an alternative approach to collect relevance judgments at large scale. In this thesis, we focus on some specific aspects related to time, scale, and agreement.
First, we address the issue of the time factor in gathering relevance label: we study how much time the judges need to assess documents. We conduct a series of four experiments which unexpectedly reveal us how introducing time limitations leads to benefits in terms of the quality of the results. Furthermore, we discuss strategies aimed to determine the right amount of time to make available to the workers for the relevance assessment, in order to both guarantee the high quality of the gathered results and the saving of the valuable resources of time and money.
Then we explore the application of magnitude estimation, a psychophysical scaling technique for the measurement of sensation, for relevance assessment. We conduct a large-scale user study across 18 TREC topics, collecting more than 50,000 magnitude estimation judgments, which result to be overall rank-aligned with ordinal judgments made by expert relevance assessors. We discuss the benefits, the reliability of the judgements collected, and the competitiveness in terms of assessor cost.
We also report some preliminary results on the agreement among judges. Often, the results of crowdsourcing experiments are affected by noise, that can be ascribed to lack of agreement among workers. This aspect should be considered as it can affect the reliability of the gathered relevance labels, as well as the overall repeatability of the experiments.openDottorato di ricerca in Informatica e scienze matematiche e fisicheopenMaddalena, Edd
Unbiased Comparative Evaluation of Ranking Functions
Eliciting relevance judgments for ranking evaluation is labor-intensive and
costly, motivating careful selection of which documents to judge. Unlike
traditional approaches that make this selection deterministically,
probabilistic sampling has shown intriguing promise since it enables the design
of estimators that are provably unbiased even when reusing data with missing
judgments. In this paper, we first unify and extend these sampling approaches
by viewing the evaluation problem as a Monte Carlo estimation task that applies
to a large number of common IR metrics. Drawing on the theoretical clarity that
this view offers, we tackle three practical evaluation scenarios: comparing two
systems, comparing systems against a baseline, and ranking systems. For
each scenario, we derive an estimator and a variance-optimizing sampling
distribution while retaining the strengths of sampling-based evaluation,
including unbiasedness, reusability despite missing data, and ease of use in
practice. In addition to the theoretical contribution, we empirically evaluate
our methods against previously used sampling heuristics and find that they
generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page
To aid or not to aid: Foreign aid and productivity in cross-country regressions
The paper reexamines empirically the robustness of competing theories of foreign aid effectiveness. By shifting the focus from the effects of aid on income to effects of aid on productivity, it is possible to put to test 3 existing theories of foreign aid effectiveness. The results provide support for the hypotheses that (i) aid has a positive effect in fostering growth of average productivity, (ii) aid doesn't operate with diminishing returns, and (iii) the magnitude of the total effect depends on climate-related circumstances. The results support the policy recommendation previously made in the literature to seriously reconsider the conditionality rule for foreign aid disbursements.Foreign Aid, cross-country, conditionality
Dim galaxies and outer halos of galaxies missed by 2MASS ? The near-infrared luminosity function and density
By using high-resolution and deep Ks band observations of early-type galaxies
of the nearby Universe and of a cluster at z=0.3 we show that the two
luminosity functions (LFs) of the local universe derived from 2MASS data miss a
fair fraction of the flux of the galaxies (more than 20 to 30%) and a whole
population of galaxies of central brightness fainter than the isophote used for
detection, but bright enough to be included in the published LFs. In
particular, the fraction of lost flux increases as the galaxy surface
brightness become fainter. Therefore, the so far derived LF slopes and
characteristic luminosity as well as luminosity density are underestimated.
Other published near-infrared LFs miss flux in general, including the LF of the
distant field computed in a 3 arcsec aperture.Comment: A&A in pres
Do the Measurements of Financial Market Inflation Expectations Yield Relevant Macroeconomic Information?
Monthly data concerning the inflation expectations of financial analysts in the Czech Republic exhibit a tendency for bias and ineffectiveness. This paper analyses, from a macroeconomic perspective, whether the surveyed data include any relevant macroeconomic information, specifically, whether the surveyed expectations correspond to market expectations considered in macroeconomic analysis and models. Using a methodology based on a simple Fisher rule, it is found that the difference between the surveyed and market inflation expectations is not statistically significant. From this perspective, it is concluded the surveyed inflation expectations bear economically relevant information.market inflation expectations, surveyed inflation expectations, Fisher rule
Assessing the Magnitude of the Concentration Parameter in a Simultaneous Equations Model
Poskitt and Skeels (2003) provide a new approximation to the sampling distribution of the IV estimator in a simultaneous equations model. This approximation is appropriate when the concentration parameter associated with the reduced form model is small and a basic purpose of this paper is to provide the practitioner with a method of ascertaining when the concentration parameter is small, and hence when the use of the Poskitt and Skeels (2003) approximation is appropriate. Existing procedures tend to focus on the notion of correlation and hypothesis testing. Approaching the problem from a different perspective leads us to advocate a different statistic for use in this problem. We provide exact and approximate distribution theory for the proposed statistic and show that it satisfies various optimality criteria not satisfied by some of its competitors. Rather than adopting a testing approach we suggest the use of p-values as a calibration device.Concentration parameter, simultaneous equations model, alienation coefficient, Wilks-lambda distribution, admissible invariant test.
- …