42,202 research outputs found

    Toward Optimal Feature Selection in Naive Bayes for Text Categorization

    Full text link
    Automated feature selection is important for text categorization to reduce the feature size and to speed up the learning process of classifiers. In this paper, we present a novel and efficient feature selection framework based on the Information Theory, which aims to rank the features with their discriminative capacity for classification. We first revisit two information measures: Kullback-Leibler divergence and Jeffreys divergence for binary hypothesis testing, and analyze their asymptotic properties relating to type I and type II errors of a Bayesian classifier. We then introduce a new divergence measure, called Jeffreys-Multi-Hypothesis (JMH) divergence, to measure multi-distribution divergence for multi-class classification. Based on the JMH-divergence, we develop two efficient feature selection methods, termed maximum discrimination (MDMD) and MD−χ2MD-\chi^2 methods, for text categorization. The promising results of extensive experiments demonstrate the effectiveness of the proposed approaches.Comment: This paper has been submitted to the IEEE Trans. Knowledge and Data Engineering. 14 pages, 5 figure

    Embedding Feature Selection for Large-scale Hierarchical Classification

    Full text link
    Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016

    Radar-based Road User Classification and Novelty Detection with Recurrent Neural Network Ensembles

    Full text link
    Radar-based road user classification is an important yet still challenging task towards autonomous driving applications. The resolution of conventional automotive radar sensors results in a sparse data representation which is tough to recover by subsequent signal processing. In this article, classifier ensembles originating from a one-vs-one binarization paradigm are enriched by one-vs-all correction classifiers. They are utilized to efficiently classify individual traffic participants and also identify hidden object classes which have not been presented to the classifiers during training. For each classifier of the ensemble an individual feature set is determined from a total set of 98 features. Thereby, the overall classification performance can be improved when compared to previous methods and, additionally, novel classes can be identified much more accurately. Furthermore, the proposed structure allows to give new insights in the importance of features for the recognition of individual classes which is crucial for the development of new algorithms and sensor requirements.Comment: 8 pages, 9 figures, accepted paper for 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, June 201

    KACST Arabic Text Classification Project: Overview and Preliminary Results

    No full text
    Electronically formatted Arabic free-texts can be found in abundance these days on the World Wide Web, often linked to commercial enterprises and/or government organizations. Vast tracts of knowledge and relations lie hidden within these texts, knowledge that can be exploited once the correct intelligent tools have been identified and applied. For example, text mining may help with text classification and categorization. Text classification aims to automatically assign text to a predefined category based on identifiable linguistic features. Such a process has different useful applications including, but not restricted to, E-Mail spam detection, web pages content filtering, and automatic message routing. In this paper an overview of King Abdulaziz City for Science and Technology (KACST) Arabic Text Classification Project will be illustrated along with some preliminary results. This project will contribute to the better understanding and elaboration of Arabic text classification techniques
    • …
    corecore