Search CORE

80,663 research outputs found

Building a document genre corpus: a profile of the KRYS I corpus

Author: Berninger Ms Vera
Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2008
Field of study

This paper describes the KRYS I corpus, consisting of documents classified into 70 genre classes. It has been constructed as part of an effort to automate document genre classification as distinct from topic detection. Previously there has been very little work on building corpora of texts which have been classified using a nontopical genre palette. The reason for this is partly due to the fact that genre as a concept, is rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation ([11]). The usefulness of genre in everyday information search is only now starting to be recognised and there is no genre classification schema that has been consolidated to have applicable value in this direction. By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the information gathering and seeking behaviour and the role of genre in these activities, as well as a way forward for creating a better corpus for testing automated genre classification tasks and the application of these tasks to other domains.

Building a Document Genre Corpus: a Profile of the KRYS I Corpus

Author: Berninger Vera
Kim Yunhyong
Ross Seamus
Publication venue
Publication date: 18/10/2008
Field of study

This paper describes the KRYS I corpus (http://www.krys-corpus.eu/Info.html), consisting of documents classified into 70 genre classes. It has been constructed as part of an effort to automate document genre classification as distinct from topic detection. Previously there has been very little work on building corpora of texts which have been classified using a non-topical genre palette. The reason for this is partly due to the fact that genre as a concept, is rooted in philosophy, rhetoric and literature, and highly complex and domain dependent in its interpretation ([11]). The usefulness of genre in everyday information search is only now starting to be recognised and there is no genre classification schema that has been consolidated to have applicable value in this direction. By presenting here our experiences in constructing the KRYS I corpus, we hope to shed light on the information gathering and seeking behaviour and the role of genre in these activities, as well as a way forward for creating a better corpus for testing automated genre classification tasks and the application of these tasks to other domains

Crossref

Enlighten

Formulating representative features with respect to document genre classification

Author: Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2008
Field of study

Genre classification (e.g. whether a document is a scientific article or magazine article) is closely bound to the physical and conceptual structure of document as well as the level of depth involved in the text. Hence, it provides a means of ranking documents retrieved by search tools according to metrics other than topical similarity. Moreover, the structural information derived from genre classification can be used to locate target information within the text. In previous studies, the detection of genre classes has been attempted by using some normalised frequency of terms or combinations of terms in the document (here, we are using term as a reference to words, phrases, syntactic units, sentences and paragraphs, as well as other patterns derived from deeper linguistic or semantic analysis). These approaches largely neglect how the term is distributed throughout the document. Here, we report the results of automated experiments based on distributive statistics of words in order to present evidence that term distribution pattern is a better indicator of genre class than term frequency.

Detecting Family Resemblance: Automated Genre Classification.

Author: Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2006
Field of study

This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.

Crossref

Directory of Open Access Journals

Enlighten

Designing an automated prototype tool for preservation quality metadata extraction for ingest into digital repository

Author: Dobreva M.
Kim Y.
Ross S.
Publication venue: 'IOS Press'
Publication date: 01/01/2008
Field of study

We present a viable framework for the automated extraction of preservation quality metadata, which is adjusted to meet the needs of, ingest to digital repositories. It has three distinctive features: wide coverage, specialisation and emphasis on quality. Wide coverage is achieved through the use of a distributed system of tool repositories, which helps to implement it over a broad range of document object types. Specialisation is maintained through the selection of the most appropriate metadata extraction tool for each case based on the identification of the digital object genre. And quality is sustained by introducing control points at selected stages of the workflow of the system. The integration of these three features as components in the ingest of material into digital repositories is a defining step ahead in the current quest for improved management of digital resources

Portsmouth University Research Portal (Pure)

Enlighten

Functional genre in Illinois State Government digital documents

Author: Jackson Larry S.
Publication venue
Publication date: 11/05/2010
Field of study

Provisions for collecting or archiving digital documents can be informed by knowledge of the genres of documents likely to be encountered. Although different aspects of collecting and curation may classify documents into genres based on differing criteria (e.g., size, file format, subject), this document addresses classification based on the functional role the document plays in state government, akin to (Toms, 2001), but here specifically Illinois State Government (ISG). The classifications listed herein are based on an overview of ISG digital documents, encountered in over nine years of gathering and archiving work with and for the Illinois State Library (ISL), and on discussions with practitioners in cataloging and in government documents librarianship. This report states definitions, and including examples of each such genre. State government documents are interesting in this regard in that they are presumably somewhat comparable to both federal government documents and business documents. Perhaps surprisingly, there are also portions of the State Web that are somewhat less than businesslike, either in tone or in technological proficiency of implementation. In this respect state government digital documents may also be useful approximations to documents produced either personally or by small activities. Having a list of government document genres can inform work in information promulgation (e.g., through website design, or the design of a series of printed materials), and the grouping of documents for digital library or archival purposes.Library of Congress / NDIIPP-2 A6075unpublishednot peer reviewe

Illinois Digital Environment for Access to Learning and Scholarship Repository

Variation of word frequencies across genre classification tasks

Author: Kim Y.
Ross S.
Publication venue: GEIE-ERCIM
Publication date: 01/01/2007
Field of study

This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments

Enlighten

Searching for Ground Truth: a stepping stone in automating genre classification

Author: A. Finn
D. Biber
G. Giuffrida
H.I. Witten
J. Karlgren
L. Breiman
M.P. Marcus
S.W. Ke
Y. Kim
Y. Kim
Publication venue
Publication date: 01/01/2007
Field of study

This paper examines genre classification of documents and its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.

Crossref

Enlighten

Feature Type Analysis in Automated Genre Classification

Author: Kim Dr Yunhyong
Ross Seamus
Publication venue
Publication date: 01/01/2007
Field of study

In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.

Document Style Recognition Using Shallow Statistical Analysis

Author: Braslavski P.
Браславский П. И.
Publication venue
Publication date: 01/01/2004
Field of study

Institutional repository of Ural Federal University named after the first President of Russia B.N.Yeltsin