Search CORE

154 research outputs found

Measuring inter-indexer consistency using a thesaurus

Author: Medelyan Olena
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms

Crossref

Research Commons@Waikato

Evaluation of controlled vocabularies by inter-indexer consistency

Author: Gil-Leiva Isidoro
Soler-Monreal Concha
Publication venue
Publication date: 01/01/2011
Field of study

Introduction. Several controlled vocabularies are used for indexing three journal articles to check if with a list of descriptors are achieved better or equals of consistency rates that with a standard thesaurus and augmented thesaurus. Method. A set of terminology of Library and Information Science was used to build a list of descriptors with equivalence relations (USE and UF), a standard thesaurus and a augmented thesaurus (all the descriptors have scope notes). Subsequently, three articles were indexed by selected indexers who had varying degrees of experience – on the one hand Library and Information Science students and on the other, professionals from various documentation centres. Hooper’s measure to find the consistency between pairs of novice indexers and experts has been applied. Analysis. Data were tabulated and analysed systematically according pairs of novice indexers and experts has been applied. Results. The tool with the best results is the list of descriptors (39.5% consistency), followed by the augmented thesaurus (29.8%) and, with an almost identical value, the standard thesaurus (27.5%). Conclusion. It is concluded that the list of descriptors in both groups returns better indexing consistency but we need more research

E-LIS

Ensuring the discoverability of digital images for social work education : an online tagging survey to test controlled vocabularies

Author: Ballantyne Neil
Daly Ellen Katherine
Publication venue
Publication date: 01/01/2009
Field of study

The digital age has transformed access to all kinds of educational content not only in text-based format but also digital images and other media. As learning technologists and librarians begin to organise these new media into digital collections for educational purposes, older problems associated with cataloguing and classifying non-text media have re-emerged. At the heart of this issue is the problem of describing complex and highly subjective images in a reliable and consistent manner. This paper reports on the findings of research designed to test the suitability of two controlled vocabularies to index and thereby improve the discoverability of images stored in the Learning Exchange, a repository for social work education and research. An online survey asked respondents to "tag", a series of images and responses were mapped against the two controlled vocabularies. Findings showed that a large proportion of user generated tags could be mapped to the controlled vocabulary terms (or their equivalents). The implications of these findings for indexing and discovering content are discussed in the context of a wider review of the literature on "folksonomies" (or user tagging) versus taxonomies and controlled vocabularies

University of Strathclyde Institutional Repository

Thesaurus-based index term extraction for agricultural documents

Author: Medelyan Olena
Witten Ian H.
Publication venue: EFITA/WICCA
Publication date: 01/01/2005
Field of study

This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction

CiteSeerX

Research Commons@Waikato

Usefulness of social tagging in organizing and providing access to the web: An analysis of indexing consistency and quality

Author: Choi Yunseon
Publication venue
Publication date: 01/08/2011
Field of study

This dissertation research points out major challenging problems with current Knowledge Organization (KO) systems, such as subject gateways or web directories: (1) the current systems use traditional knowledge organization systems based on controlled vocabulary which is not very well suited to web resources, and (2) information is organized by professionals not by users, which means it does not reflect intuitively and instantaneously expressed users’ current needs. In order to explore users’ needs, I examined social tags which are user-generated uncontrolled vocabulary. As investment in professionally-developed subject gateways and web directories diminishes (support for both BUBL and Intute, examined in this study, is being discontinued), understanding characteristics of social tagging becomes even more critical. Several researchers have discussed social tagging behavior and its usefulness for classification or retrieval; however, further research is needed to qualitatively and quantitatively investigate social tagging in order to verify its quality and benefit. This research particularly examined the indexing consistency of social tagging in comparison to professional indexing to examine the quality and efficacy of tagging. The data analysis was divided into three phases: analysis of indexing consistency, analysis of tagging effectiveness, and analysis of tag attributes. Most indexing consistency studies have been conducted with a small number of professional indexers, and they tended to exclude users. Furthermore, the studies mainly have focused on physical library collections. This dissertation research bridged these gaps by (1) extending the scope of resources to various web documents indexed by users and (2) employing the Information Retrieval (IR) Vector Space Model (VSM) - based indexing consistency method since it is suitable for dealing with a large number of indexers. As a second phase, an analysis of tagging effectiveness with tagging exhaustivity and tag specificity was conducted to ameliorate the drawbacks of consistency analysis based on only the quantitative measures of vocabulary matching. Finally, to investigate tagging pattern and behaviors, a content analysis on tag attributes was conducted based on the FRBR model. The findings revealed that there was greater consistency over all subjects among taggers compared to that for two groups of professionals. The analysis of tagging exhaustivity and tag specificity in relation to tagging effectiveness was conducted to ameliorate difficulties associated with limitations in the analysis of indexing consistency based on only the quantitative measures of vocabulary matching. Examination of exhaustivity and specificity of social tags provided insights into particular characteristics of tagging behavior and its variation across subjects. To further investigate the quality of tags, a Latent Semantic Analysis (LSA) was conducted to determine to what extent tags are conceptually related to professionals’ keywords and it was found that tags of higher specificity tended to have a higher semantic relatedness to professionals’ keywords. This leads to the conclusion that the term’s power as a differentiator is related to its semantic relatedness to documents. The findings on tag attributes identified the important bibliographic attributes of tags beyond describing subjects or topics of a document. The findings also showed that tags have essential attributes matching those defined in FRBR. Furthermore, in terms of specific subject areas, the findings originally identified that taggers exhibited different tagging behaviors representing distinctive features and tendencies on web documents characterizing digital heterogeneous media resources. These results have led to the conclusion that there should be an increased awareness of diverse user needs by subject in order to improve metadata in practical applications. This dissertation research is the first necessary step to utilize social tagging in digital information organization by verifying the quality and efficacy of social tagging. This dissertation research combined both quantitative (statistics) and qualitative (content analysis using FRBR) approaches to vocabulary analysis of tags which provided a more complete examination of the quality of tags. Through the detailed analysis of tag properties undertaken in this dissertation, we have a clearer understanding of the extent to which social tagging can be used to replace (and in some cases to improve upon) professional indexing

Illinois Digital Environment for Access to Learning and Scholarship Repository

Human-competitive automatic topic indexing

Author: Medelyan Olena
Publication venue: The University of Waikato
Publication date: 01/01/2009
Field of study

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

Research Commons@Waikato

CERN Document Server

12 years on – Is the NLM medical text indexer still useful and relevant?

Author: Alan Aronson
Dina Demner-Fushman
James Mork
Publication venue: Springer Nature
Publication date: 01/01/2017
Field of study

Springer - Publisher Connector

Recommended from our members

Selecting and Categorizing Textual Descriptions of Images in the Context of an Image Indexer's Toolkit

Author: Abels Eileen
Bradley Rachael
Jenemann Laura
Klavans Judith
Passonneau Rebecca J.
Sheffield Carolyn
Yano Tae
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2007
Field of study

We describe a series of studies aimed at identifying specifications for a text extraction module of an image indexer's toolkit. The materials used in the studies consist of images paired with paragraph sequences that describe the images. We administered a pilot survey to visual resource center professionals at three universities to determine what types of paragraphs would be preferred for metadata selection. Respondents generally showed a strong preference for one of two paragraphs they were presented with, indicating that not all paragraphs that describe images are seen as good sources of metadata. We developed a set of semantic category labels to assign to spans of text in order to distinguish between different types of information about the images, thus to classify metadata contexts. Human agreement on metadata is notoriously variable. In order to maximize agreement, we conducted four human labeling experiments using the seven semantic category labels we developed. A subset of our labelers had much higher inter-annotator reliability, and highest reliability occurs when labelers can pick two labels per text unit

Columbia University Academic Commons

Thesaurus based automatic keyphrase indexing

Author: Medelyan Olena
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

We propose a new method that enhances automatic keyphrase extraction by using semantic information on terms and phrases gleaned from a domain-specific thesaurus. We evaluate the results against keyphrase sets assigned by a state-of-the-art keyphrase extraction system and those assigned by six professional indexers

Crossref

Research Commons@Waikato