294,043 research outputs found
Mistake-Driven Learning in Text Categorization
Learning problems in the text processing domain often map the text to a space
whose dimensions are the measured features of the text, e.g., its words. Three
characteristic properties of this domain are (a) very high dimensionality, (b)
both the learned concepts and the instances reside very sparsely in the feature
space, and (c) a high variation in the number of active features in an
instance. In this work we study three mistake-driven learning algorithms for a
typical task of this nature -- text categorization. We argue that these
algorithms -- which categorize documents by learning a linear separator in the
feature space -- have a few properties that make them ideal for this domain. We
then show that a quantum leap in performance is achieved when we further modify
the algorithms to better address some of the specific characteristics of the
domain. In particular, we demonstrate (1) how variation in document length can
be tolerated by either normalizing feature weights or by using negative
weights, (2) the positive effect of applying a threshold range in training, (3)
alternatives in considering feature frequency, and (4) the benefits of
discarding features while training. Overall, we present an algorithm, a
variation of Littlestone's Winnow, which performs significantly better than any
other algorithm tested on this task using a similar feature set.Comment: 9 pages, uses aclap.st
Neural Discourse Structure for Text Categorization
We show that discourse structure, as defined by Rhetorical Structure Theory
and provided by an existing discourse parser, benefits text categorization. Our
approach uses a recursive neural network and a newly proposed attention
mechanism to compute a representation of the text that focuses on salient
content, from the perspective of both RST and the task. Experiments consider
variants of the approach and illustrate its strengths and weaknesses.Comment: ACL 2017 camera ready versio
Using bag-of-concepts to improve the performance of support vector machines in text categorization
This paper investigates the use of concept-based representations for text categorization. We introduce a new approach to create concept-based text representations, and apply it to a standard text categorization collection. The representations are used as input to a Support Vector Machine classifier, and the results show that there are certain categories for which concept-based representations constitute a viable supplement to word-based ones. We also demonstrate how the performance of the Support Vector Machine can be improved by combining representations
Non-Standard Words as Features for Text Categorization
This paper presents categorization of Croatian texts using Non-Standard Words
(NSW) as features. Non-Standard Words are: numbers, dates, acronyms,
abbreviations, currency, etc. NSWs in Croatian language are determined
according to Croatian NSW taxonomy. For the purpose of this research, 390 text
documents were collected and formed the SKIPEZ collection with 6 classes:
official, literary, informative, popular, educational and scientific. Text
categorization experiment was conducted on three different representations of
the SKIPEZ collection: in the first representation, the frequencies of NSWs are
used as features; in the second representation, the statistic measures of NSWs
(variance, coefficient of variation, standard deviation, etc.) are used as
features; while the third representation combines the first two feature sets.
Naive Bayes, CN2, C4.5, kNN, Classification Trees and Random Forest algorithms
were used in text categorization experiments. The best categorization results
are achieved using the first feature set (NSW frequencies) with the
categorization accuracy of 87%. This suggests that the NSWs should be
considered as features in highly inflectional languages, such as Croatian. NSW
based features reduce the dimensionality of the feature space without standard
lemmatization procedures, and therefore the bag-of-NSWs should be considered
for further Croatian texts categorization experiments.Comment: IEEE 37th International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO 2014), pp. 1415-1419,
201
- …
