Search CORE

109 research outputs found

Measuring inter-indexer consistency using a thesaurus

Author: Medelyan Olena
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

When professional indexers independently assign terms to a given document, the term sets generally differ between indexers. Studies of inter-indexer consistency measure the percentage of matching index terms, but none of them consider the semantic relationships that exist amongst these terms. We propose to represent multiple-indexers data in a vector space and use the cosine metric as a new consistency measure that can be extended by semantic relations between index terms. We believe that this new measure is more accurate and realistic than existing ones and therefore more suitable for evaluation of automatically extracted index terms

Crossref

Research Commons@Waikato

Thesaurus-based index term extraction for agricultural documents

Author: Medelyan Olena
Witten Ian H.
Publication venue: EFITA/WICCA
Publication date: 01/01/2005
Field of study

This paper describes a new algorithm for automatically extracting index terms from documents relating to the domain of agriculture. The domain-specific Agrovoc thesaurus developed by the FAO is used both as a controlled vocabulary and as a knowledge base for semantic matching. The automatically assigned terms are evaluated against a manually indexed 200-item sample of the FAO’s document repository, and the performance of the new algorithm is compared with a state-of-the-art system for keyphrase extraction

CiteSeerX

Research Commons@Waikato

Thesaurus based automatic keyphrase indexing

Author: Medelyan Olena
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

We propose a new method that enhances automatic keyphrase extraction by using semantic information on terms and phrases gleaned from a domain-specific thesaurus. We evaluate the results against keyphrase sets assigned by a state-of-the-art keyphrase extraction system and those assigned by six professional indexers

Crossref

Research Commons@Waikato

The Art of a Loan: “When the Loan Sharks Meet Damien Hirst’s ‘$12-Million Stuffed Shark’”

Author: Medelyan Valerie
Publication venue: DigitalCommons@Pace
Publication date: 05/06/2015
Field of study

Part I of this Article introduces the reader to the typical types of loans that banks make, includes an in-depth description of a secured loan, and finishes with a discussion of the due diligence requirements of banks. Part II identifies the unique complexities posed by art when it is used as collateral, comparing and contrasting the banks’ process when approving a loan secured by commonly-used assets versus a loan secured by art. Part III discusses the banks’ growing willingness to approve art-backed loans, and identifies the safeguards built into such deals. Part IV introduces the sub-prime lenders of the art market, discussing pawn shop regulations and loans made by “luxury pawn shops” and “art dealers.” Part V compares and contrasts bank loans and “art lender” loans with an emphasis on defaulting borrowers. Part VI discusses the effects of art-backed loans in general, predicting that such practices may lead to a significant drop in the price of art in the market, placing more works in private collections, and thereby decreasing the amount of art available for viewing to the general public. Finally, Part VII briefly concludes

DigitalCommons@Pace

Mining Domain-Specific Thesauri from Wikipedia: A case study

Author: Medelyan Olena
Milne David N.
Witten Ian H.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2006
Field of study

Domain-specific thesauri are high-cost, high-maintenance, high-value knowledge structures. We show how the classic thesaurus structure of terms and links can be mined automatically from Wikipedia. In a comparison with a professional thesaurus for agriculture we find that Wikipedia contains a substantial proportion of its concepts and semantic relations; furthermore it has impressive coverage of contemporary documents in the domain. Thesauri derived using our techniques capitalize on existing public efforts and tend to reflect contemporary language usage better than their costly, painstakingly-constructed manual counterparts

CiteSeerX

Crossref

Research Commons@Waikato

Subject metadata support powered by Maui

Author: Medelyan Olena
Perrone Vye
Witten Ian H.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2010
Field of study

Selecting subject headings and keywords is a chore for all metadata editors, who often leave these fields blank or incomplete—even when there are no guidelines and any word or phrase can be chosen. For example, tags are absent from the vast majority of citations in the social scholarly reference repository CiteULike. Libraries employ professional cataloguers and indexers to ensure consistent subject metadata in their records. Because this task is time-consuming, professionals and volunteers alike would welcome high-quality automatically generated suggestions for the main topics of a document

Research Commons@Waikato

Human-competitive automatic topic indexing

Author: Medelyan Olena
Publication venue: The University of Waikato
Publication date: 01/01/2009
Field of study

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

Research Commons@Waikato

CERN Document Server

Mining Meaning from Wikipedia

Author: Legg Catherine
Medelyan Olena
Milne David
Witten Ian H.
Publication venue
Publication date: 01/01/2008
Field of study

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This article provides a comprehensive description of this work. It focuses on research that extracts and makes use of the concepts, relations, facts and descriptions found in Wikipedia, and organizes the work into four broad categories: applying Wikipedia to natural language processing; using it to facilitate information retrieval and information extraction; and as a resource for ontology building. The article addresses how Wikipedia is being used as is, how it is being improved and adapted, and how it is being combined with other structures to create entirely new resources. We identify the research groups and individuals involved, and how their work has developed in the last few years. We provide a comprehensive list of the open-source software they have produced.Comment: An extensive survey of re-using information in Wikipedia in natural language processing, information retrieval and extraction and ontology building. Accepted for publication in International Journal of Human-Computer Studie

arXiv.org e-Print Archive

CiteSeerX

Deakin Research Online

Research Commons@Waikato

Consensus-based Approach for Keyword Extraction from Urban Events Collections

Author: Brill
Escovedo
Hulth
Kim
Laerty
Li
Medelyan
Mihalcea
Sutton
Timonen
Wan
Wang
Zhang
Publication venue: Ediciones Universidad de Salamanca (España)
Publication date: 10/05/2015
Field of study

Automatic keyword extraction (AKE) from textual sources took a valuable step towards harnessing the problem of efficient scanning of large document collections. Particularly in the context of urban mobility, where the most relevant events in the city are advertised on-line, it becomes difficult to know exactly what is happening in a place./nIn this paper we tackle this problem by extracting a set of keywords from different kinds of textual sources, focusing on the urban events context. We propose an ensemble of automatic keyword extraction systems KEA (Key-phrase Extraction Algorithm) and KUSCO (Knowledge Unsupervised Search for instantiating Concepts on lightweight Ontologies) and Conditional Random Fields (CRF)./nUnlike KEA and KUSCO which are well-known tools for automatic keyword extraction, CRF needs further pre-processing. Therefore, a tool for handling AKE from the documents using CRF is developed. The architecture for the AKE ensemble system is designed and efficient integration of component applications is presented in which a consensus between such classifiers is achieved. Finally, we empirically show that our AKE ensemble system significantly succeeds on baseline sources and urban events collections

Crossref

Directory of Open Access Journals

Gestion del Repositorio Documental de la Universidad de Salamanca