28 research outputs found

    Evolving rules for document classification

    Get PDF
    We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications

    Using IR techniques to improve Automated Text Classification

    Get PDF
    This paper performs a study on the pre-processing phase of the automated text classification problem. We use the linear Support Vector Machine paradigm applied to datasets written in the English and the European Portuguese languages – the Reuters and the Portuguese Attorney General’s Office datasets, respectively. The study can be seen as a search, for the best document representa- tion, in three different axes: the feature reduction (using linguistic in- formation), the feature selection (using word frequencies) and the term weighting (using information retrieval measures)

    XML Documents Within a Legal Domain: Standards and Tools for the Italian Legislative Environment

    No full text

    Uncertainty-based Noise Reduction and Term Selection in Text Categorization

    No full text
    This paper introduces a new criterium for term selection, which is based on the notion of Uncertainty. Term selection according to this criterium is performed by the elimination of noisy terms on a class-by-class basis, rather than by selecting the most significant ones. Uncertainty-base

    Using Negation and Phrases in Inducing Rules for Text Classification

    No full text

    Parallel-Sequential Texture Analysis

    No full text
    Color induced texture analysis is explored, using two texture analysis techniques: the co-occurrence matrix and the color correlogram as well as color histograms. Several quantization schemes for six color spaces and the human-based 11 color quantization scheme have been applied. The VisTex texture database was used as test bed. A new color induced texture analysis approach is introduced: the parallel-sequential approach; i.e., the color correlogram combined with the color histogram. This new approach was found to be highly successful (up to 96% correct classification). Moreover, the 11 color quantization scheme performed excellent (94% correct classification) and should, therefore, be incorporated for real-time image analysis. In general, the results emphasize the importance of the use of color for texture analysis and of color as global image feature. Moreover, it illustrates the complementary character of both features

    Towards Language Independent Automated Learning of Text Categorization Models

    No full text
    corecore