Search CORE

173 research outputs found

Human-competitive automatic topic indexing

Author: Medelyan Olena
Publication venue: The University of Waikato
Publication date: 01/01/2009
Field of study

Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

Research Commons@Waikato

CERN Document Server

lexiDB:a scalable corpus database management system

Author: Coole Matt
Mariani John Amedeo
Rayson Paul Edward
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 05/12/2016
Field of study

lexiDB is a scalable corpus database management system designed to fulfill corpus linguistics retrieval queries on multi-billion-word multiply-annotated corpora. It is based on a distributed architecture that allows the system to scale out to support ever larger text collections. This paper presents an overview of the architecture behind lexiDB as well as a demonstration of its functionality. We present lexiDB's performance metrics based on the AWS (Amazon Web Services) infrastructure with two part-of-speech and semantically tagged billion word corpora: Historical Hansard and EEBO (Early English Books Online)

Lancaster E-Prints

Special Libraries, April 1969

Author: Special Libraries Association
Publication venue: SJSU ScholarWorks
Publication date: 01/04/1969
Field of study

Volume 60, Issue 4https://scholarworks.sjsu.edu/sla_sl_1969/1003/thumbnail.jp

SJSU ScholarWorks