7,591 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    A probabilistic threshold model: Analyzing semantic categorization data with the Rasch model

    Get PDF
    According to the Threshold Theory (Hampton, 1995, 2007) semantic categorization decisions come about through the placement of a threshold criterion along a dimension that represents items' similarity to the category representation. The adequacy of this theory is assessed by applying a formalization of the theory, known as the Rasch model (Rasch, 1960; Thissen & Steinberg, 1986), to categorization data for eight natural language categories and subjecting it to a formal test. In validating the model special care is given to its ability to account for inter- and intra-individual differences in categorization and their relationship with item typicality. Extensions of the Rasch model that can be used to uncover the nature of category representations and the sources of categorization differences are discussed

    Optimal Feature Subset Selection Based on Combining Document Frequency and Term Frequency for Text Classification

    Get PDF
    Feature selection plays a vital role to reduce the high dimension of the feature space in the text document classification problem. The dimension reduction of feature space reduces the computation cost and improves the text classification system accuracy. Hence, the identification of a proper subset of the significant features of the text corpus is needed to classify the data in less computational time with higher accuracy. In this proposed research, a novel feature selection method which combines the document frequency and the term frequency (FS-DFTF) is used to measure the significance of a term. The optimal feature subset which is selected by our proposed work is evaluated using Naive Bayes and Support Vector Machine classifier with various popular benchmark text corpus datasets. The experimental outcome confirms that the proposed method has a better classification accuracy when compared with other feature selection techniques

    Semantic Sort: A Supervised Approach to Personalized Semantic Relatedness

    Full text link
    We propose and study a novel supervised approach to learning statistical semantic relatedness models from subjectively annotated training examples. The proposed semantic model consists of parameterized co-occurrence statistics associated with textual units of a large background knowledge corpus. We present an efficient algorithm for learning such semantic models from a training sample of relatedness preferences. Our method is corpus independent and can essentially rely on any sufficiently large (unstructured) collection of coherent texts. Moreover, the approach facilitates the fitting of semantic models for specific users or groups of users. We present the results of extensive range of experiments from small to large scale, indicating that the proposed method is effective and competitive with the state-of-the-art.Comment: 37 pages, 8 figures A short version of this paper was already published at ECML/PKDD 201

    A pattern mining approach for information filtering systems

    Get PDF
    It is a big challenge to clearly identify the boundary between positive and negative streams for information filtering systems. Several attempts have used negative feedback to solve this challenge; however, there are two issues for using negative relevance feedback to improve the effectiveness of information filtering. The first one is how to select constructive negative samples in order to reduce the space of negative documents. The second issue is how to decide noisy extracted features that should be updated based on the selected negative samples. This paper proposes a pattern mining based approach to select some offenders from the negative documents, where an offender can be used to reduce the side effects of noisy features. It also classifies extracted features (i.e., terms) into three categories: positive specific terms, general terms, and negative specific terms. In this way, multiple revising strategies can be used to update extracted features. An iterative learning algorithm is also proposed to implement this approach on the RCV1 data collection, and substantial experiments show that the proposed approach achieves encouraging performance and the performance is also consistent for adaptive filtering as well

    The role of the frontal cortex in memory: an investigation of the Von Restorff effect

    Get PDF
    Evidence from neuropsychology and neuroimaging indicate that the pre-frontal cortex (PFC) plays an important role in human memory. Although frontal patients are able to form new memories, these memories appear qualitatively different from those of controls by lacking distinctiveness. Neuroimaging studies of memory indicate activation in the PFC under deep encoding conditions, and under conditions of semantic elaboration. Based on these results, we hypothesize that the PFC enhances memory by extracting differences and commonalities in the studied material. To test this hypothesis, we carried out an experimental investigation to test the relationship between the PFC-dependent factors and semantic factors associated with common and specific features of words. These experiments were performed using Free-Recall of word lists with healthy adults, exploiting the correlation between PFC function and fluid intelligence. As predicted, a correlation was found between fluid intelligence and the Von-Restorff effect (better memory for semantic isolates, e.g., isolate “cat” within category members of “fruit”). Moreover, memory for the semantic isolate was found to depend on the isolate's serial position. The isolate item tends to be recalled first, in comparison to non-isolates, suggesting that the process interacts with short term memory. These results are captured within a computational model of free recall, which includes a PFC mechanism that is sensitive to both commonality and distinctiveness, sustaining a trade-off between the two
    corecore