8,967 research outputs found
Implications of Inter-Rater Agreement on a Student Information Retrieval Evaluation
This paper is about an information retrieval evaluation on three different
retrieval-supporting services. All three services were designed to compensate
typical problems that arise in metadata-driven Digital Libraries, which are not
adequately handled by a simple tf-idf based retrieval. The services are: (1) a
co-word analysis based query expansion mechanism and re-ranking via (2)
Bradfordizing and (3) author centrality. The services are evaluated with
relevance assessments conducted by 73 information science students. Since the
students are neither information professionals nor domain experts the question
of inter-rater agreement is taken into consideration. Two important
implications emerge: (1) the inter-rater agreement rates were mainly fair to
moderate and (2) after a data-cleaning step which erased the assessments with
poor agreement rates the evaluation data shows that the three retrieval
services returned disjoint but still relevant result sets.Comment: 7 pages, 3 figures, LWA 2010, Workshop I
Human assessments of document similarity
Two studies are reported that examined the reliability of human assessments of document similarity and the association between human ratings and the results of n-gram automatic text analysis (ATA). Human interassessor reliability (IAR) was moderate to poor. However, correlations between average human ratings and n-gram solutions were strong. The average correlation between ATA and individual human solutions was greater than IAR. N-gram length influenced the strength of association, but optimum string length depended on the nature of the text (technical vs. nontechnical). We conclude that the methodology applied in previous studies may have led to overoptimistic views on human reliability, but that an optimal n-gram solution can provide a good approximation of the average human assessment of document similarity, a result that has important implications for future development of document visualization systems
Applying Science Models for Search
The paper proposes three different kinds of science models as value-added
services that are integrated in the retrieval process to enhance retrieval
quality. The paper discusses the approaches Search Term Recommendation,
Bradfordizing and Author Centrality on a general level and addresses
implementation issues of the models within a real-life retrieval environment.Comment: 14 pages, 3 figures, ISI 201
Recommended from our members
Applying latent semantic analysis to computer assisted assessment in the Computer Science domain: a framework, a tool, and an evaluation
This dissertation argues that automated assessment systems can be useful for both students and educators provided that the results correspond well with human markers. Thus, evaluating such a system is crucial. I present an evaluation framework and show how and why it can be useful for both producers and consumers of automated assessment systems. The framework is a refinement of a research taxonomy that came out of the effort to analyse the literature review of systems based on Latent Semantic Analysis (LSA), a statistical natural language processing technique that has been used for automated assessment of essays. The evaluation framework can help developers publish their results in a format that is comprehensive, relatively compact, and useful to other researchers.
The thesis claims that, in order to see a complete picture of an automated assessment system, certain pieces must be emphasised. It presents the framework as a jigsaw puzzle whose pieces join together to form the whole picture.
The dissertation uses the framework to compare the accuracy of human markers and EMMA, the LSA-based assessment system I wrote as part of this dissertation. EMMA marks short, free text answers in the domain of computer science. I conducted a study of five human markers and then used the results as a benchmark against which to evaluate EMMA. An integral part of the evaluation was the success metric. The standard inter-rater reliability statistic was not useful; I located a new statistic and applied it to the domain of computer assisted assessment for the first time, as far as I know.
Although EMMA exceeds human markers on a few questions, overall it does not achieve the same level of agreement with humans as humans do with each other. The last chapter maps out a plan for further research to improve EMMA
The Impact of Residential Treatment on Emotionally Disturbed Boys
Within the past four decades, social work has witnessed the development of increasingly specialized servicecs to children, among these a sort of “total impact therapy” generally defined as residential treatment. In conjunction with the basic social work values of the bio-psycho-social nature of human maladjustment, residential centres have attempted to help the child effect a happier adjustment to his life situation by meeting some ungratified basic need. Institutions for dependent children complimented those for custodial care of even isolation; contemporary residential treatment centres are designed to meet a broader range of needs of the child than those of forty years ago through a variety of approaches, often referred to as milieu therapy. Consideration of the common needs of children is basic to questions concerning the place of institutional treatment and the particular type of child for which this social work service is the most appropriate one.
The residential treatment centre addresses the whole gamut of a child’s needs from physical care to rehabilitation. Exposure to, and participation in, a group life experience simulating as closely as possible the family or community life experience is the element differentiating residential care from other treatment modes. By involvement in the realities of his daily situation and the working through or resolution of these, the child is helped to cope with his own growth and development—physical, emotional, and social.
Problems and questions examined in this paper revolve around the residential treatment centre defined vaguely by the Child Welfare League of America as “A building....maintained and operated by a chartered agency, organization or institution, whose main purpose is to provide shelter and care to a group of unrelated children and youths up to eighteen years of age.” More specifically, the concern for research, the proposal and plans for implementation are focused on Mount St. Joseph, an autonomous, non-profit institution providing care for boys with moderate to severe emotional disturbances
A comparison of homonym meaning frequency estimates derived from movie and television subtitles, free association, and explicit ratings
First Online: 10 September 2018Most words are ambiguous, with interpretation dependent on context. Advancing theories of ambiguity resolution is important for any general theory of language processing, and for resolving inconsistencies in observed ambiguity effects across experimental tasks. Focusing on homonyms (words such as bank with unrelated meanings EDGE OF A RIVER vs. FINANCIAL INSTITUTION), the present work advances theories and methods for estimating the relative frequency of their meanings, a factor that shapes observed ambiguity effects. We develop a new method for estimating meaning frequency based on the meaning of a homonym evoked in lines of movie and television subtitles according to human raters. We also replicate and extend a measure of meaning frequency derived from the classification of free associates. We evaluate the internal consistency of these measures, compare them to published estimates based on explicit ratings of each meaning’s frequency, and compare each set of norms in predicting performance in lexical and semantic decision mega-studies. All measures have high internal consistency and show agreement, but each is also associated with unique variance, which may be explained by integrating cognitive theories of memory with the demands of different experimental methodologies. To derive frequency estimates, we collected manual classifications of 533 homonyms over 50,000 lines of subtitles, and of 357 homonyms across over 5000 homonym–associate pairs. This database—publicly available at: www.blairarmstrong.net/homonymnorms/—constitutes a novel resource for computational cognitive modeling and computational linguistics, and we offer suggestions around good practices for its use in training and testing models on labeled data
Science Models as Value-Added Services for Scholarly Information Systems
The paper introduces scholarly Information Retrieval (IR) as a further
dimension that should be considered in the science modeling debate. The IR use
case is seen as a validation model of the adequacy of science models in
representing and predicting structure and dynamics in science. Particular
conceptualizations of scholarly activity and structures in science are used as
value-added search services to improve retrieval quality: a co-word model
depicting the cognitive structure of a field (used for query expansion), the
Bradford law of information concentration, and a model of co-authorship
networks (both used for re-ranking search results). An evaluation of the
retrieval quality when science model driven services are used turned out that
the models proposed actually provide beneficial effects to retrieval quality.
From an IR perspective, the models studied are therefore verified as expressive
conceptualizations of central phenomena in science. Thus, it could be shown
that the IR perspective can significantly contribute to a better understanding
of scholarly structures and activities.Comment: 26 pages, to appear in Scientometric
Heuristic Principles and Differential Judgments in the Assessment of Information Quality
Information quality (IQ) is a multidimensional construct and includes dimensions such as accuracy, completeness, objectivity, and representation that are difficult to measure. Recently, research has shown that independent assessors who rated IQ yielded high inter-rater agreement for some information quality dimensions as opposed to others. In this paper, we explore the reasons that underlie the differences in the “measurability” of IQ. Employing Gigerenzer’s “building blocks” framework, we conjecture that the feasibility of using a set of heuristic principles consistently when assessing different dimensions of IQ is a key factor driving inter-rater agreement in IQ judgments. We report on two studies. In the first study, we qualitatively explored the manner in which participants applied the heuristic principles of search rules, stopping rules, and decision rules in assessing the IQ dimensions of accuracy, completeness, objectivity, and representation. In the second study, we investigated the extent to which participants could reach an agreement in rating the quality of Wikipedia articles along these dimensions. Our findings show an alignment between the consistent application of heuristic principles and inter-rater agreement levels found on particular dimensions of IQ judgments. Specifically, on the dimensions of completeness and representation, assessors applied the heuristic principles consistently and tended to agree in their ratings, whereas, on the dimensions of accuracy and objectivity, they not apply the heuristic principles in a uniform manner and inter-rater agreement was relatively low. We discuss our findings implications for research and practice
- …