28 research outputs found

    Preliminary results from a crowdsourcing experiment in immunohistochemistry

    Get PDF
    Background: Crowdsourcing, i.e., the outsourcing of tasks typically performed by a few experts to a large crowd as an open call, has been shown to be reasonably effective in many cases, like Wikipedia, the Chess match of Kasparov against the world in 1999, and several others. The aim of the present paper is to describe the setup of an experimentation of crowdsourcing techniques applied to the quantification of immunohistochemistry. Methods: Fourteen Images from MIB1-stained breast specimens were first manually counted by a pathologist, then submitted to a crowdsourcing platform through a specifically developed application. 10 positivity evaluations for each image have been collected and summarized using their median. The positivity values have been then compared to the gold standard provided by the pathologist by means of Spearman correlation. Results: Contributors were in total 28, and evaluated 4.64 images each on average. Spearman correlation between gold and crowdsourced positivity percentages is 0.946 (p < 0.001). Conclusions: Aim of the experiment was to understand how to use crowdsourcing for an image analysis task that is currently time-consuming when done by human experts. Crowdsourced work can be used in various ways, in particular statistically agregating data to reduce identification errors. However, in this preliminary experimentation we just considered the most basic indicator, that is the median positivity percentage, which provided overall good results. This method might be more aimed to research than routine: when a large number of images are in need of ad-hoc evaluation, crowdsourcing may represent a quick answer to the need. \ua9 Della Mea et al

    Video test collection with graded relevance assessments

    Get PDF
    Relevance is a complex, but core, concept within the field of Information Retrieval. In order to allow system comparisons the many factors that influence relevance are often discarded to allow abstraction to a single score relating to relevance. This means that a great wealth of information is often discarded. In this paper we outline the creation of a video test collection with graded relevance assessments, to the best of our knowledge the first example of such a test collection for video retrieval. To directly address the shortcoming above we also gathered behavioural and perceptual data from assessors during the assessment process. All of this information along with judgements are available for download. Our intention is to allow other researchers to supplement the judgements to help create an adaptive test collection which contains supplementary information rather than a completely static collection with binary judgements

    PACRR: A Position-Aware Neural IR Model for Relevance Matching

    Full text link
    In order to adopt deep learning for information retrieval, models are needed that can capture all relevant information required to assess the relevance of a document to a given user query. While previous works have successfully captured unigram term matches, how to fully employ position-dependent information such as proximity and term dependencies has been insufficiently explored. In this work, we propose a novel neural IR model named PACRR aiming at better modeling position-dependent interactions between a query and a document. Extensive experiments on six years' TREC Web Track data confirm that the proposed model yields better results under multiple benchmarks.Comment: To appear in EMNLP201

    Image Based Model for Document Search and Re-ranking

    Full text link
    Traditional Web search engines do not use the images in the web pages to search relevant documents for a given query. Instead, they are typically operated by computing a measure of agreement between the keywords provided by the user and only the text portion of each web page. This project describes whether the image content appearing in a Web page can be used to enhance the semantic description of Web page and accordingly improve the performance of a keyword-based search engine. A Web-scalable system is presented in such a way that exploits a pure text-based search engine that finds an initial set of candidate documents as per given query. Then, by using visual information extracted from the images contained in the pages, the candidate set will be re-ranked. The computational efficiency of traditional text-based search engines will be maintained by the resulting system with only a small additional storage cost that will be needed to predetermine the visual information

    Transitivity, Time Consumption, and Quality of Preference Judgments in Crowdsourcing

    Get PDF
    Preference judgments have been demonstrated as a better alternative to graded judgments to assess the relevance of documents relative to queries. Existing work has verified transitivity among preference judgments when collected from trained judges, which reduced the number of judgments dramatically. Moreover, strict preference judgments and weak preference judgments, where the latter additionally allow judges to state that two documents are equally relevant for a given query, are both widely used in literature. However, whether transitivity still holds when collected from crowdsourcing, i.e., whether the two kinds of preference judgments behave similarly remains unclear. In this work, we collect judgments from multiple judges using a crowdsourcing platform and aggregate them to compare the two kinds of preference judgments in terms of transitivity, time consumption, and quality. That is, we look into whether aggregated judgments are transitive, how long it takes judges to make them, and whether judges agree with each other and with judgments from TREC. Our key findings are that only strict preference judgments are transitive. Meanwhile, weak preference judgments behave differently in terms of transitivity, time consumption, as well as of quality of judgment

    Relevance Judgment Convergence Degree – A Measure of Inconsistency among Assessors for Information Retrieval

    Get PDF
    Relevance judgment of human assessors is inherently subjective and dynamic when evaluation datasets are created for Information Retrieval (IR) systems. However, a small group of experts’ relevance judgment results are usually taken as ground truth to “objectively” evaluate the performance of the IR systems. Recent trends intend to employ a group of judges, such as outsourcing, to alleviate the potentially biased judgment results stemmed from using only a single expert’s judgment. Nevertheless, different judges may have different opinions and may not agree with each other, and the inconsistency in human relevance judgment may affect the IR system evaluation results. In this research, we introduce a Relevance Judgment Convergence Degree (RJCD) to measure the quality of queries in the evaluation datasets. Experimental results reveal a strong correlation coefficient between the proposed RJCD score and the performance differences between the two IR systems

    Assessing Relevance of Tweets for Risk Communication

    Get PDF
    Although Twitter is used for emergency management activities, the relevance of tweets during a hazard event is still open to debate. In this study, six different computational (i.e. Natural Language Processing) and spatiotemporal analytical approaches were implemented to assess the relevance of risk information extracted from tweets obtained during the 2013 Colorado flood event. Primarily, tweets containing information about the flooding events and its impacts were analysed. Examination of the relationships between tweet volume and its content with precipitation amount, damage extent, and official reports revealed that relevant tweets provided information about the event and its impacts rather than any other risk information that public expects to receive via alert messages. However, only 14% of the geo-tagged tweets and only 0.06% of the total fire hose tweets were found to be relevant to the event. By providing insight into the quality of social media data and its usefulness to emergency management activities, this study contributes to the literature on quality of big data. Future research in this area would focus on assessing the reliability of relevant tweets for disaster related situational awareness

    A comparison of primary and secondary relevance judgements for real-life topics

    Get PDF
    The notion of relevance is fundamental to the field of Information Retrieval. Within the field a generally accepted conception of relevance as inherently subjective has emerged, with an individual’s assessment of relevance influenced by numerous contextual factors. In this paper we present a user study that examines in detail the differences between primary and secondary assessors on a set of “real-world” topics which were gathered specifically for the work. By gathering topics which are representative of the staff and students at a major university, at a particular point in time, we aim to explore differences between primary and secondary relevance judgements for real-life search tasks. Findings suggest that while secondary assessors may find the assessment task challenging in various ways (they generally possess less interest and knowledge in secondary topics and take longer to assess documents), agreement between primary and secondary assessors is high
    corecore