245 research outputs found
Cheap IR Evaluation: Fewer Topics, No Relevance Judgements, and Crowdsourced Assessments
To evaluate Information Retrieval (IR) effectiveness, a possible approach is
to use test collections, which are composed of a collection of documents, a set
of description of information needs (called topics), and a set of relevant
documents to each topic. Test collections are modelled in a competition
scenario: for example, in the well known TREC initiative, participants run
their own retrieval systems over a set of topics and they provide a ranked list
of retrieved documents; some of the retrieved documents (usually the first
ranked) constitute the so called pool, and their relevance is evaluated by
human assessors; the document list is then used to compute effectiveness
metrics and rank the participant systems. Private Web Search companies also run
their in-house evaluation exercises; although the details are mostly unknown,
and the aims are somehow different, the overall approach shares several issues
with the test collection approach.
The aim of this work is to: (i) develop and improve some state-of-the-art
work on the evaluation of IR effectiveness while saving resources, and (ii)
propose a novel, more principled and engineered, overall approach to test
collection based effectiveness evaluation.
[...
Metrology: Measurement Assurance Program Guidelines
The 5300.4 series of NASA Handbooks for Reliability and Quality Assurance Programs have provisions for the establishment and utilization of a documented metrology system to control measurement processes and to provide objective evidence of quality conformance. The intent of these provisions is to assure consistency and conformance to specifications and tolerances of equipment, systems, materials, and processes procured and/or used by NASA, its international partners, contractors, subcontractors, and suppliers. This Measurement Assurance Program (MAP) guideline has the specific objectives to: (1) ensure the quality of measurements made within NASA programs; (2) establish realistic measurement process uncertainties; (3) maintain continuous control over the measurement processes; and (4) ensure measurement compatibility among NASA facilities. The publication addresses MAP methods as applied within and among NASA installations and serves as a guide to: control measurement processes at the local level (one facility); conduct measurement assurance programs in which a number of field installations are joint participants; and conduct measurement integrity (round robin) experiments in which a number of field installations participate to assess the overall quality of particular measurement processes at a point in time
Compare statistical significance tests for information retrieval evaluation
Preprint of our Journal of the Association for Information Science and Technology (JASIST) paper[Abstract] Statistical significance tests can provide evidence that the observed difference in performance between two methods is not due to chance. In Information Retrieval, some studies have examined the validity and suitability of such tests for comparing search systems.We argue here that current methods for assessing the reliability of statistical tests suffer from some methodological weaknesses, and we propose a novel way to study significance tests for retrieval evaluation. Using Score Distributions, we model the output of multiple search systems, produce simulated search results from such models, and compare them using various significance tests. A key strength of this approach is that we assess statistical tests under perfect knowledge about the truth or falseness of the null hypothesis. This new method for studying the power of significance tests in Information Retrieval evaluation is formal and innovative. Following this type of analysis, we found that both the sign test and Wilcoxon signed test have more power than the permutation test and the t-test. The sign test and Wilcoxon signed test also have a good behavior in terms of type I errors. The bootstrap test shows few type I errors, but it has less power than the other methods tested.Ministerio de Econom´ıa y Competitividad; TIN2015-64282-RXunta de Galicia; GPC 2016/035Xunta de Galicia; ED431G/01Xunta de Galicia; ED431G/0
Use Case Oriented Medical Visual Information Retrieval & System Evaluation
Large amounts of medical visual data are produced daily in hospitals, while new imaging techniques continue to emerge. In addition, many images are made available continuously via publications in the scientific literature and can also be valuable for clinical routine, research and education. Information retrieval systems are useful tools to provide access to the biomedical literature and fulfil the information needs of medical professionals. The tools developed in this thesis can potentially help clinicians make decisions about difficult diagnoses via a case-based retrieval system based on a use case associated with a specific evaluation task. This system retrieves articles from the biomedical literature when querying with a case description and attached images. This thesis proposes a multimodal approach for medical case-based retrieval with focus on the integration of visual information connected to text. Furthermore, the ImageCLEFmed evaluation campaign was organised during this thesis promoting medical retrieval system evaluation
Crowdsourcing Relevance: Two Studies on Assessment
Crowdsourcing has become an alternative approach to collect relevance judgments at large scale. In this thesis, we focus on some specific aspects related to time, scale, and agreement.
First, we address the issue of the time factor in gathering relevance label: we study how much time the judges need to assess documents. We conduct a series of four experiments which unexpectedly reveal us how introducing time limitations leads to benefits in terms of the quality of the results. Furthermore, we discuss strategies aimed to determine the right amount of time to make available to the workers for the relevance assessment, in order to both guarantee the high quality of the gathered results and the saving of the valuable resources of time and money.
Then we explore the application of magnitude estimation, a psychophysical scaling technique for the measurement of sensation, for relevance assessment. We conduct a large-scale user study across 18 TREC topics, collecting more than 50,000 magnitude estimation judgments, which result to be overall rank-aligned with ordinal judgments made by expert relevance assessors. We discuss the benefits, the reliability of the judgements collected, and the competitiveness in terms of assessor cost.
We also report some preliminary results on the agreement among judges. Often, the results of crowdsourcing experiments are affected by noise, that can be ascribed to lack of agreement among workers. This aspect should be considered as it can affect the reliability of the gathered relevance labels, as well as the overall repeatability of the experiments.openDottorato di ricerca in Informatica e scienze matematiche e fisicheopenMaddalena, Edd
An efficient, practical, portable mapping technique on computational grids
Grid computing provides a powerful, virtual parallel system known as a computational Grid on which users can run parallel applications to solve problems quickly. However, users must be careful to allocate tasks to nodes properly because improper allocation of only one task could result in lengthy executions of applications, or even worse, applications could crash. This allocation problem is called the mapping problem, and an entity that tackles this problem is called a mapper. In this thesis, we aim to develop an efficient, practical, portable mapper. To study the mapping problem, researchers often make unrealistic assumptions such as that nodes of Grids are always reliable, that execution times of tasks assigned to nodes are known a priori, or that detailed information of parallel applications is always known. As a result, the practicality and portability of mappers developed in such conditions are uncertain. Our review of related work suggested that a more efficient tool is required to study this problem; therefore, we developed GMap, a simulator researchers/developers can use to develop practical, portable mappers. The fact that nodes are not always reliable leads to the development of an algorithm for predicting the reliability of nodes and a predictor for identifying reliable nodes of Grids. Experimental results showed that the predictor reduced the chance of failures in executions of applications by half. The facts that execution times of tasks assigned to nodes are not known a priori and that detailed information of parallel applications is not alw ays known, lead to the evaluation of five nearest-neighbour (nn) execution time estimators: k-nn smoothing, k-nn, adaptive k-nn, one-nn, and adaptive one-nn. Experimental results showed that adaptive k-nn was the most efficient one. We also implemented the predictor and the estimator in GMap. Using GMap, we could reliably compare the efficiency of six mapping algorithms: Min-min, Max-min, Genetic Algorithms, Simulated Annealing, Tabu Search, and Quick-quality Map, with none of the preceding unrealistic assumptions. Experimental results showed that Quick-quality Map was the most efficient one. As a result of these findings, we achieved our goal in developing an efficient, practical, portable mapper
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
Statistical significance testing is widely accepted as a means to assess how
well a difference in effectiveness reflects an actual difference between
systems, as opposed to random noise because of the selection of topics.
According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is
the most popular choice among IR researchers. However, previous work has
suggested computer intensive tests like the bootstrap or the permutation test,
based mainly on theoretical arguments. On empirical grounds, others have
suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the
question of which tests we should use has accompanied IR and related fields for
decades now. Previous theoretical studies on this matter were limited in that
we know that test assumptions are not met in IR experiments, and empirical
studies were limited in that we do not have the necessary control over the null
hypotheses to compute actual Type I and Type II error rates under realistic
conditions. Therefore, not only is it unclear which test to use, but also how
much trust we should put in them. In contrast to past studies, in this paper we
employ a recent simulation methodology from TREC data to go around these
limitations. Our study comprises over 500 million p-values computed for a range
of tests, systems, effectiveness measures, topic set sizes and effect sizes,
and for both the 2-tail and 1-tail cases. Having such a large supply of IR
evaluation data with full knowledge of the null hypotheses, we are finally in a
position to evaluate how well statistical significance tests really behave with
IR data, and make sound recommendations for practitioners.Comment: 10 pages, 6 figures, SIGIR 201
Dynamic re-optimization techniques for stream processing engines and object stores
Large scale data storage and processing systems are strongly motivated by the need to store and analyze massive datasets. The complexity of a large class of these systems is rooted in their distributed nature, extreme scale, need for real-time response, and streaming nature. The use of these systems on multi-tenant, cloud environments with potential resource interference necessitates fine-grained monitoring and control. In this dissertation, we present efficient, dynamic techniques for re-optimizing stream-processing systems and transactional object-storage systems.^ In the context of stream-processing systems, we present VAYU, a per-topology controller. VAYU uses novel methods and protocols for dynamic, network-aware tuple-routing in the dataflow. We show that the feedback-driven controller in VAYU helps achieve high pipeline throughput over long execution periods, as it dynamically detects and diagnoses any pipeline-bottlenecks. We present novel heuristics to optimize overlays for group communication operations in the streaming model.^ In the context of object-storage systems, we present M-Lock, a novel lock-localization service for distributed transaction protocols on scale-out object stores to increase transaction throughput. Lock localization refers to dynamic migration and partitioning of locks across nodes in the scale-out store to reduce cross-partition acquisition of locks. The service leverages the observed object-access patterns to achieve lock-clustering and deliver high performance. We also present TransMR, a framework that uses distributed, transactional object stores to orchestrate and execute asynchronous components in amorphous data-parallel applications on scale-out architectures
Towards Meaningful Statements in IR Evaluation. Mapping Evaluation Measures to Interval Scales
Recently, it was shown that most popular IR measures are not interval-scaled,
implying that decades of experimental IR research used potentially improper
methods, which may have produced questionable results. However, it was unclear
if and to what extent these findings apply to actual evaluations and this
opened a debate in the community with researchers standing on opposite
positions about whether this should be considered an issue (or not) and to what
extent.
In this paper, we first give an introduction to the representational
measurement theory explaining why certain operations and significance tests are
permissible only with scales of a certain level. For that, we introduce the
notion of meaningfulness specifying the conditions under which the truth (or
falsity) of a statement is invariant under permissible transformations of a
scale. Furthermore, we show how the recall base and the length of the run may
make comparison and aggregation across topics problematic.
Then we propose a straightforward and powerful approach for turning an
evaluation measure into an interval scale, and describe an experimental
evaluation of the differences between using the original measures and the
interval-scaled ones.
For all the regarded measures - namely Precision, Recall, Average Precision,
(Normalized) Discounted Cumulative Gain, Rank-Biased Precision and Reciprocal
Rank - we observe substantial effects, both on the order of average values and
on the outcome of significance tests. For the latter, previously significant
differences turn out to be insignificant, while insignificant ones become
significant. The effect varies remarkably between the tests considered but
overall, on average, we observed a 25% change in the decision about which
systems are significantly different and which are not
- …