2 research outputs found
Text Data Mining: Theory and Methods
This paper provides the reader with a very brief introduction to some of the
theory and methods of text data mining. The intent of this article is to
introduce the reader to some of the current methodologies that are employed
within this discipline area while at the same time making the reader aware of
some of the interesting challenges that remain to be solved within the area.
Finally, the articles serves as a very rudimentary tutorial on some of
techniques while also providing the reader with a list of references for
additional study.Comment: Published in at http://dx.doi.org/10.1214/07-SS016 the Statistics
Surveys (http://www.i-journals.org/ss/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Effectiveness of document representation for classification
Conventionally, document classification researches focus on improving the learning capabilities of classifiers. Nevertheless, according to our observation, the effectiveness of classification is limited by the suitability of document representation. Intuitively, the more features that are used in representation, the more comprehensive that documents are represented. However, if a representation contains too many irrelevant features, the classifier would suffer from not only the curse of high dimensionality, but also overfitting. To address this problem of suitableness of document representations, we present a classifier-independent approach to measure the effectiveness of document representations. Our approach utilises a labelled document corpus to estimate the distribution of documents in the feature space. By looking through documents in this way, we can clearly identify the contributions made by different features toward the document classification. Some experiments have been performed to show how the effectiveness is evaluated. Our approach can be used as a tool to assist feature selection, dimensionality reduction and document classification