8,788 research outputs found
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Introspective knowledge acquisition for case retrieval networks in textual case base reasoning.
Textual Case Based Reasoning (TCBR) aims at effective reuse of information contained in unstructured documents. The key advantage of TCBR over traditional Information Retrieval systems is its ability to incorporate domain-specific knowledge to facilitate case comparison beyond simple keyword matching. However, substantial human intervention is needed to acquire and transform this knowledge into a form suitable for a TCBR system. In this research, we present automated approaches that exploit statistical properties of document collections to alleviate this knowledge acquisition bottleneck. We focus on two important knowledge containers: relevance knowledge, which shows relatedness of features to cases, and similarity knowledge, which captures the relatedness of features to each other. The terminology is derived from the Case Retrieval Network (CRN) retrieval architecture in TCBR, which is used as the underlying formalism in this thesis applied to text classification. Latent Semantic Indexing (LSI) generated concepts are a useful resource for relevance knowledge acquisition for CRNs. This thesis introduces a supervised LSI technique called sprinkling that exploits class knowledge to bias LSI's concept generation. An extension of this idea, called Adaptive Sprinkling has been proposed to handle inter-class relationships in complex domains like hierarchical (e.g. Yahoo directory) and ordinal (e.g. product ranking) classification tasks. Experimental evaluation results show the superiority of CRNs created with sprinkling and AS, not only over LSI on its own, but also over state-of-the-art classifiers like Support Vector Machines (SVM). Current statistical approaches based on feature co-occurrences can be utilized to mine similarity knowledge for CRNs. However, related words often do not co-occur in the same document, though they co-occur with similar words. We introduce an algorithm to efficiently mine such indirect associations, called higher order associations. Empirical results show that CRNs created with the acquired similarity knowledge outperform both LSI and SVM. Incorporating acquired knowledge into the CRN transforms it into a densely connected network. While improving retrieval effectiveness, this has the unintended effect of slowing down retrieval. We propose a novel retrieval formalism called the Fast Case Retrieval Network (FCRN) which eliminates redundant run-time computations to improve retrieval speed. Experimental results show FCRN's ability to scale up over high dimensional textual casebases. Finally, we investigate novel ways of visualizing and estimating complexity of textual casebases that can help explain performance differences across casebases. Visualization provides a qualitative insight into the casebase, while complexity is a quantitative measure that characterizes classification or retrieval hardness intrinsic to a dataset. We study correlations of experimental results from the proposed approaches against complexity measures over diverse casebases
- …