1,039 research outputs found
Recognizing Text Genres with Simple Metrics Using Discriminant Analysis
A simple method for categorizing texts into predetermined text genre
categories using the statistical standard technique of discriminant analysis is
demonstrated with application to the Brown corpus. Discriminant analysis makes
it possible use a large number of parameters that may be specific for a certain
corpus or information stream, and combine them into a small number of
functions, with the parameters weighted on basis of how useful they are for
discriminating text genres. An application to information retrieval is
discussed.Comment: 6 pages, LaTeX, In proceedings of COLING 9
Stylistic Variation in an Information Retrieval Experiment
Texts exhibit considerable stylistic variation. This paper reports an
experiment where a corpus of documents (N= 75 000) is analyzed using various
simple stylistic metrics. A subset (n = 1000) of the corpus has been previously
assessed to be relevant for answering given information retrieval queries. The
experiment shows that this subset differs significantly from the rest of the
corpus in terms of the stylistic metrics studied.Comment: Proceedings of NEMLAP-
Assessed Relevance and Stylistic Variation
Texts exhibit considerable stylistic variation. This paper reports an
experiment where a large corpus of documents is analyzed using various
simple stylistic metrics. A subset of the corpus has been previously
assessed to be relevant for answering given information retrieval
queries. The experiment shows that this subset differs significantly from
the rest of the corpus in terms of the stylistic metrics studied
Variation of word frequencies across genre classification tasks
This paper examines automated genre classification of text documents and its role in enabling the effective management of digital documents by digital libraries and other repositories. Genre classification, which narrows down the possible structure of a document, is a valuable step in
realising the general automatic extraction of semantic metadata essential to the efficient management and use of digital objects. In the present report, we present an analysis of word frequencies in different genre classes in an effort to understand the distinction between independent classification tasks. In particular, we examine automated experiments on thirty-one genre classes to determine the relationship between the word frequency metrics and the degree of its significance in carrying out classification in varying environments
Examining Variations of Prominent Features in Genre Classification.
This paper investigates the correlation between features of three types (visual, stylistic and topical types) and genre classes. The majority of previous studies in automated genre classification have created models based on an amalgamated representation of a document using a combination of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. In this paper we use classifiers independently modeled on three groups of features to examine six genre classes to show that the strongest features for making one classification is not necessarily the best features for carrying out another classification.
Thumbs up? Sentiment Classification using Machine Learning Techniques
We consider the problem of classifying documents not by topic, but by overall
sentiment, e.g., determining whether a review is positive or negative. Using
movie reviews as data, we find that standard machine learning techniques
definitively outperform human-produced baselines. However, the three machine
learning methods we employed (Naive Bayes, maximum entropy classification, and
support vector machines) do not perform as well on sentiment classification as
on traditional topic-based categorization. We conclude by examining factors
that make the sentiment classification problem more challenging.Comment: To appear in EMNLP-200
Formulating representative features with respect to document genre classification
Genre classification (e.g. whether a document
is a scientific article or magazine article) is closely
bound to the physical and conceptual structure of document
as well as the level of depth involved in the text.
Hence, it provides a means of ranking documents retrieved
by search tools according to metrics other than
topical similarity. Moreover, the structural information
derived from genre classification can be used to locate
target information within the text. In previous studies,
the detection of genre classes has been attempted
by using some normalised frequency of terms or combinations
of terms in the document (here, we are using
term as a reference to words, phrases, syntactic
units, sentences and paragraphs, as well as other patterns
derived from deeper linguistic or semantic analysis).
These approaches largely neglect how the term is
distributed throughout the document. Here, we report
the results of automated experiments based on distributive
statistics of words in order to present evidence that
term distribution pattern is a better indicator of genre
class than term frequency.
Feature Type Analysis in Automated Genre Classification
In this paper, we compare classifiers based on language model, image, and stylistic features for automated genre classification. The majority of previous studies in genre classification have created models based on an amalgamated representation of a document using a multitude of features. In these models, the inseparable roles of different features make it difficult to determine a means of improving the classifier when it exhibits poor performance in detecting selected genres. By independently modeling and comparing classifiers based on features belonging to three types, describing visual, stylistic, and topical properties, we demonstrate that different genres have distinctive feature strengths.
- …