27,498 research outputs found

    On crowdsourcing relevance magnitudes for information retrieval evaluation

    Get PDF
    4siMagnitude estimation is a psychophysical scaling technique for the measurement of sensation, where observers assign numbers to stimuli in response to their perceived intensity. We investigate the use of magnitude estimation for judging the relevance of documents for information retrieval evaluation, carrying out a large-scale user study across 18 TREC topics and collecting over 50,000 magnitude estimation judgments using crowdsourcing. Our analysis shows that magnitude estimation judgments can be reliably collected using crowdsourcing, are competitive in terms of assessor cost, and are, on average, rank-aligned with ordinal judgments made by expert relevance assessors. We explore the application of magnitude estimation for IR evaluation, calibrating two gain-based effectiveness metrics, nDCG and ERR, directly from user-reported perceptions of relevance. A comparison of TREC system effectiveness rankings based on binary, ordinal, and magnitude estimation relevance shows substantial variation; in particular, the top systems ranked using magnitude estimation and ordinal judgments differ substantially. Analysis of the magnitude estimation scores shows that this effect is due in part to varying perceptions of relevance: different users have different perceptions of the impact of relative differences in document relevance. These results have direct implications for IR evaluation, suggesting that current assumptions about a single view of relevance being sufficient to represent a population of users are unlikely to hold.partially_openopenMaddalena, Eddy; Mizzaro, Stefano; Scholer, Falk; Turpin, AndrewMaddalena, Eddy; Mizzaro, Stefano; Scholer, Falk; Turpin, Andre

    Judgments of effort exerted by others are influenced by received rewards

    Get PDF
    Estimating invested effort is a core dimension for evaluating own and others’ actions, and views on the relationship between effort and rewards are deeply ingrained in various societal attitudes. Internal representations of effort, however, are inherently noisy, e.g. due to the variability of sensorimotor and visceral responses to physical exertion. The uncertainty in effort judgments is further aggravated when there is no direct access to the internal representations of exertion – such as when estimating the effort of another person. Bayesian cue integration suggests that this uncertainty can be resolved by incorporating additional cues that are predictive of effort, e.g. received rewards. We hypothesized that judgments about the effort spent on a task will be influenced by the magnitude of received rewards. Additionally, we surmised that such influence might further depend on individual beliefs regarding the relationship between hard work and prosperity, as exemplified by a conservative work ethic. To test these predictions, participants performed an effortful task interleaved with a partner and were informed about the obtained reward before rating either their own or the partner’s effort. We show that higher rewards led to higher estimations of exerted effort in self-judgments, and this effect was even more pronounced for other-judgments. In both types of judgment, computational modelling revealed that reward information and sensorimotor markers of exertion were combined in a Bayes-optimal manner in order to reduce uncertainty. Remarkably, the extent to which rewards influenced effort judgments was associated with conservative world-views, indicating links between this phenomenon and general beliefs about the relationship between effort and earnings in society

    Crowdsourcing Relevance: Two Studies on Assessment

    Get PDF
    Crowdsourcing has become an alternative approach to collect relevance judgments at large scale. In this thesis, we focus on some specific aspects related to time, scale, and agreement. First, we address the issue of the time factor in gathering relevance label: we study how much time the judges need to assess documents. We conduct a series of four experiments which unexpectedly reveal us how introducing time limitations leads to benefits in terms of the quality of the results. Furthermore, we discuss strategies aimed to determine the right amount of time to make available to the workers for the relevance assessment, in order to both guarantee the high quality of the gathered results and the saving of the valuable resources of time and money. Then we explore the application of magnitude estimation, a psychophysical scaling technique for the measurement of sensation, for relevance assessment. We conduct a large-scale user study across 18 TREC topics, collecting more than 50,000 magnitude estimation judgments, which result to be overall rank-aligned with ordinal judgments made by expert relevance assessors. We discuss the benefits, the reliability of the judgements collected, and the competitiveness in terms of assessor cost. We also report some preliminary results on the agreement among judges. Often, the results of crowdsourcing experiments are affected by noise, that can be ascribed to lack of agreement among workers. This aspect should be considered as it can affect the reliability of the gathered relevance labels, as well as the overall repeatability of the experiments.openDottorato di ricerca in Informatica e scienze matematiche e fisicheopenMaddalena, Edd

    Unbiased Comparative Evaluation of Ranking Functions

    Full text link
    Eliciting relevance judgments for ranking evaluation is labor-intensive and costly, motivating careful selection of which documents to judge. Unlike traditional approaches that make this selection deterministically, probabilistic sampling has shown intriguing promise since it enables the design of estimators that are provably unbiased even when reusing data with missing judgments. In this paper, we first unify and extend these sampling approaches by viewing the evaluation problem as a Monte Carlo estimation task that applies to a large number of common IR metrics. Drawing on the theoretical clarity that this view offers, we tackle three practical evaluation scenarios: comparing two systems, comparing kk systems against a baseline, and ranking kk systems. For each scenario, we derive an estimator and a variance-optimizing sampling distribution while retaining the strengths of sampling-based evaluation, including unbiasedness, reusability despite missing data, and ease of use in practice. In addition to the theoretical contribution, we empirically evaluate our methods against previously used sampling heuristics and find that they generally cut the number of required relevance judgments at least in half.Comment: Under review; 10 page

    To aid or not to aid: Foreign aid and productivity in cross-country regressions

    Get PDF
    The paper reexamines empirically the robustness of competing theories of foreign aid effectiveness. By shifting the focus from the effects of aid on income to effects of aid on productivity, it is possible to put to test 3 existing theories of foreign aid effectiveness. The results provide support for the hypotheses that (i) aid has a positive effect in fostering growth of average productivity, (ii) aid doesn't operate with diminishing returns, and (iii) the magnitude of the total effect depends on climate-related circumstances. The results support the policy recommendation previously made in the literature to seriously reconsider the conditionality rule for foreign aid disbursements.Foreign Aid, cross-country, conditionality

    Dim galaxies and outer halos of galaxies missed by 2MASS ? The near-infrared luminosity function and density

    Get PDF
    By using high-resolution and deep Ks band observations of early-type galaxies of the nearby Universe and of a cluster at z=0.3 we show that the two luminosity functions (LFs) of the local universe derived from 2MASS data miss a fair fraction of the flux of the galaxies (more than 20 to 30%) and a whole population of galaxies of central brightness fainter than the isophote used for detection, but bright enough to be included in the published LFs. In particular, the fraction of lost flux increases as the galaxy surface brightness become fainter. Therefore, the so far derived LF slopes and characteristic luminosity as well as luminosity density are underestimated. Other published near-infrared LFs miss flux in general, including the LF of the distant field computed in a 3 arcsec aperture.Comment: A&A in pres

    Do the Measurements of Financial Market Inflation Expectations Yield Relevant Macroeconomic Information?

    Get PDF
    Monthly data concerning the inflation expectations of financial analysts in the Czech Republic exhibit a tendency for bias and ineffectiveness. This paper analyses, from a macroeconomic perspective, whether the surveyed data include any relevant macroeconomic information, specifically, whether the surveyed expectations correspond to market expectations considered in macroeconomic analysis and models. Using a methodology based on a simple Fisher rule, it is found that the difference between the surveyed and market inflation expectations is not statistically significant. From this perspective, it is concluded the surveyed inflation expectations bear economically relevant information.market inflation expectations, surveyed inflation expectations, Fisher rule

    Assessing the Magnitude of the Concentration Parameter in a Simultaneous Equations Model

    Get PDF
    Poskitt and Skeels (2003) provide a new approximation to the sampling distribution of the IV estimator in a simultaneous equations model. This approximation is appropriate when the concentration parameter associated with the reduced form model is small and a basic purpose of this paper is to provide the practitioner with a method of ascertaining when the concentration parameter is small, and hence when the use of the Poskitt and Skeels (2003) approximation is appropriate. Existing procedures tend to focus on the notion of correlation and hypothesis testing. Approaching the problem from a different perspective leads us to advocate a different statistic for use in this problem. We provide exact and approximate distribution theory for the proposed statistic and show that it satisfies various optimality criteria not satisfied by some of its competitors. Rather than adopting a testing approach we suggest the use of p-values as a calibration device.Concentration parameter, simultaneous equations model, alienation coefficient, Wilks-lambda distribution, admissible invariant test.
    corecore