31,231 research outputs found
Statistical Significance Testing in Information Retrieval: An Empirical Analysis of Type I, Type II and Type III Errors
Statistical significance testing is widely accepted as a means to assess how
well a difference in effectiveness reflects an actual difference between
systems, as opposed to random noise because of the selection of topics.
According to recent surveys on SIGIR, CIKM, ECIR and TOIS papers, the t-test is
the most popular choice among IR researchers. However, previous work has
suggested computer intensive tests like the bootstrap or the permutation test,
based mainly on theoretical arguments. On empirical grounds, others have
suggested non-parametric alternatives such as the Wilcoxon test. Indeed, the
question of which tests we should use has accompanied IR and related fields for
decades now. Previous theoretical studies on this matter were limited in that
we know that test assumptions are not met in IR experiments, and empirical
studies were limited in that we do not have the necessary control over the null
hypotheses to compute actual Type I and Type II error rates under realistic
conditions. Therefore, not only is it unclear which test to use, but also how
much trust we should put in them. In contrast to past studies, in this paper we
employ a recent simulation methodology from TREC data to go around these
limitations. Our study comprises over 500 million p-values computed for a range
of tests, systems, effectiveness measures, topic set sizes and effect sizes,
and for both the 2-tail and 1-tail cases. Having such a large supply of IR
evaluation data with full knowledge of the null hypotheses, we are finally in a
position to evaluate how well statistical significance tests really behave with
IR data, and make sound recommendations for practitioners.Comment: 10 pages, 6 figures, SIGIR 201
Rhetorical relations for information retrieval
Typically, every part in most coherent text has some plausible reason for its
presence, some function that it performs to the overall semantics of the text.
Rhetorical relations, e.g. contrast, cause, explanation, describe how the parts
of a text are linked to each other. Knowledge about this socalled discourse
structure has been applied successfully to several natural language processing
tasks. This work studies the use of rhetorical relations for Information
Retrieval (IR): Is there a correlation between certain rhetorical relations and
retrieval performance? Can knowledge about a document's rhetorical relations be
useful to IR? We present a language model modification that considers
rhetorical relations when estimating the relevance of a document to a query.
Empirical evaluation of different versions of our model on TREC settings shows
that certain rhetorical relations can benefit retrieval effectiveness notably
(> 10% in mean average precision over a state-of-the-art baseline)
Rapid automatized naming and reading performance: a meta-analysis
Evidence that rapid naming skill is associated with reading ability has become increasingly prevalent in recent years. However, there is considerable variation in the literature concerning the magnitude of this relationship. The objective of the present study was to provide a comprehensive analysis of the evidence on the relationship between rapid automatized naming (RAN) and reading performance. To this end, we conducted a meta-analysis of the correlational relationship between these 2 constructs to (a) determine the overall strength of the RAN-reading association and (b) identify variables that systematically moderate this relationship. A random-effects model analysis of data from 137 studies (857 effect sizes; 28,826 participants) indicated a moderate-to-strong relationship between RAN and reading performance (r = .43, I-2 = 68.40). Further analyses revealed that RAN contributes to the 4 measures of reading (word reading, text reading, non-word reading, and reading comprehension), but higher coefficients emerged in favor of real word reading and text reading. RAN stimulus type and type of reading score were the factors with the greatest moderator effect on the magnitude of the RAN-reading relationship. The consistency of orthography and the subjects' grade level were also found to impact this relationship, although the effect was contingent on reading outcome. It was less evident whether the subjects' reading proficiency played a role in the relationship. Implications for future studies are discussed
Analysis of Network Clustering Algorithms and Cluster Quality Metrics at Scale
Notions of community quality underlie network clustering. While studies
surrounding network clustering are increasingly common, a precise understanding
of the realtionship between different cluster quality metrics is unknown. In
this paper, we examine the relationship between stand-alone cluster quality
metrics and information recovery metrics through a rigorous analysis of four
widely-used network clustering algorithms -- Louvain, Infomap, label
propagation, and smart local moving. We consider the stand-alone quality
metrics of modularity, conductance, and coverage, and we consider the
information recovery metrics of adjusted Rand score, normalized mutual
information, and a variant of normalized mutual information used in previous
work. Our study includes both synthetic graphs and empirical data sets of sizes
varying from 1,000 to 1,000,000 nodes.
We find significant differences among the results of the different cluster
quality metrics. For example, clustering algorithms can return a value of 0.4
out of 1 on modularity but score 0 out of 1 on information recovery. We find
conductance, though imperfect, to be the stand-alone quality metric that best
indicates performance on information recovery metrics. Our study shows that the
variant of normalized mutual information used in previous work cannot be
assumed to differ only slightly from traditional normalized mutual information.
Smart local moving is the best performing algorithm in our study, but
discrepancies between cluster evaluation metrics prevent us from declaring it
absolutely superior. Louvain performed better than Infomap in nearly all the
tests in our study, contradicting the results of previous work in which Infomap
was superior to Louvain. We find that although label propagation performs
poorly when clusters are less clearly defined, it scales efficiently and
accurately to large graphs with well-defined clusters
- …