114 research outputs found
Automating Metadata Extraction: Genre Classification
A problem that frequently arises in the management and integration of scientific data is the lack of context and semantics that would link data encoded in disparate ways. To bridge the discrepancy, it often helps to mine scientific texts to aid the understanding of the database. Mining relevant text can be significantly aided by the availability of descriptive and semantic metadata. The Digital Curation Centre (DCC) has undertaken research to automate the extraction of metadata from documents in PDF([22]). Documents may include scientific journal papers, lab notes or even emails. We suggest genre classification as a first step toward automating metadata extraction. The classification method will be built on looking at the documents from five directions; as an object of specific visual format, a layout of strings with characteristic grammar, an object with stylo-metric signatures, an object with meaning and purpose, and an object linked to previously classified objects and external sources. Some results of experiments in relation to the first two directions are described here; they are meant to be indicative of the promise underlying this multi-faceted approach.
Detecting Family Resemblance: Automated Genre Classification.
This paper presents results in automated genre classification of digital documents in PDF format. It describes genre classification as an important ingredient in contextualising scientific data and in retrieving targetted material for improving research. The current paper compares the role of visual layout, stylistic features and language model features in clustering documents and presents results in retrieving five selected genres (Scientific Article, Thesis, Periodicals, Business Report, and Form) from a pool of materials populated with documents of the nineteen most popular genres found in our experimental data set.
Searching for Ground Truth: a stepping stone in automating genre classification
This paper examines genre classification of documents and
its role in enabling the effective automated management of digital documents by digital libraries and other repositories. We have previously presented genre classification as a valuable step toward achieving automated extraction of descriptive metadata for digital material. Here, we present results from experiments using human labellers, conducted to assist in genre characterisation and the prediction of obstacles which need to be overcome by an automated system, and to contribute to the process of creating a solid testbed corpus for extending automated genre classification and testing metadata extraction tools across genres. We also describe the performance of two classifiers based on image and stylistic modeling features in labelling the data resulting from the agreement of three human labellers across fifteen genre classes.
Thumbs up? Sentiment Classification using Machine Learning Techniques
We consider the problem of classifying documents not by topic, but by overall
sentiment, e.g., determining whether a review is positive or negative. Using
movie reviews as data, we find that standard machine learning techniques
definitively outperform human-produced baselines. However, the three machine
learning methods we employed (Naive Bayes, maximum entropy classification, and
support vector machines) do not perform as well on sentiment classification as
on traditional topic-based categorization. We conclude by examining factors
that make the sentiment classification problem more challenging.Comment: To appear in EMNLP-200
Examining Variations of Prominent Features in Genre Classification.
This paper investigates the correlation between features of three types (visual, stylistic and topical types) and genre classes. The majority of previous studies in automated genre classification have created models based on an amalgamated representation of a document using a combination of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. In this paper we use classifiers independently modeled on three groups of features to examine six genre classes to show that the strongest features for making one classification is not necessarily the best features for carrying out another classification.
Variation of word frequencies across genre classification tasks
This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in
realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
Formulating representative features with respect to document genre classification
Genre classification (e.g. whether a document
is a scientific article or magazine article) is closely
bound to the physical and conceptual structure of document
as well as the level of depth involved in the text.
Hence, it provides a means of ranking documents retrieved
by search tools according to metrics other than
topical similarity. Moreover, the structural information
derived from genre classification can be used to locate
target information within the text. In previous studies,
the detection of genre classes has been attempted
by using some normalised frequency of terms or combinations
of terms in the document (here, we are using
term as a reference to words, phrases, syntactic
units, sentences and paragraphs, as well as other patterns
derived from deeper linguistic or semantic analysis).
These approaches largely neglect how the term is
distributed throughout the document. Here, we report
the results of automated experiments based on distributive
statistics of words in order to present evidence that
term distribution pattern is a better indicator of genre
class than term frequency.
Refining the use of the web (and web search) as a language teaching and learning resource
The web is a potentially useful corpus for language study because it provides examples of language that are contextualized and authentic, and is large and easily searchable. However, web contents are heterogeneous in the extreme, uncontrolled and hence 'dirty,' and exhibit features different from the written and spoken texts in other linguistic corpora. This article explores the use of the web and web search as a resource for language teaching and learning. We describe how a particular derived corpus containing a trillion word tokens in the form of n-grams has been filtered by word lists and syntactic constraints and used to create three digital library collections, linked with other corpora and the live web, that exploit the affordances of web text and mitigate some of its constraints
- …