    A New Feature Selection Method based on Intuitionistic Fuzzy Entropy to Categorize Text Documents

    Selection of highly discriminative feature in text document plays a major challenging role in categorization. Feature selection is an important task that involves dimensionality reduction of feature matrix, which in turn enhances the performance of categorization. This article presents a new feature selection method based on Intuitionistic Fuzzy Entropy (IFE) for Text Categorization. Firstly, Intuitionistic Fuzzy C-Means (IFCM) clustering method is employed to compute the intuitionistic membership values. The computed intuitionistic membership values are used to estimate intuitionistic fuzzy entropy via Match degree. Further, features with lower entropy values are selected to categorize the text documents. To find the efficacy of the proposed method, experiments are conducted on three standard benchmark datasets using three classifiers. F-measure is used to assess the performance of the classifiers. The proposed method shows impressive results as compared to other well known feature selection methods. Moreover, Intuitionistic Fuzzy Set (IFS) property addresses the uncertainty limitations of traditional fuzzy set

    Maximum Entropy Modeling with Feature Selection for Text Categorization

    Abstract. Maximum entropy provides a reasonable way of estimating probability distributions and has been widely used for a number of language processing tasks. In this paper, we explore the use of different feature selection methods for text categorization using maximum entropy modeling. We also propose a new feature selection method based on the difference between the relative document frequencies of a feature for both relevant and irrelevant classes. Our experiments on the Reuters RCV1 data set show that our own feature selection performs better than the other feature selection methods and maximum entropy modeling is a competitive method for text categorization

    Statystyczne metody klasyfikacji tekst贸w

    W ostatnich latach, wraz z szybkim rozwojem technologii komputerowych i internetowych, coraz wi臋kszego znaczenia nabieraj膮 komputerowe metody badania tekstu, w szczeg贸lno艣ci metody ustalania sentymentu czy te偶 wyd藕wi臋ku tekstu. Metody komputerowe mog膮 by膰 p贸藕niej wykorzystywane w takich zagadnieniach, jak streszczanie tekstu, wyszukiwanie informacji z tekstu, sprawdzanie poprawno艣ci tekstu, maszynowe t艂umaczenie tekstu i wielu innych. Niniejsza monografia zawiera przegl膮d metod analizy sentymentu dla dokument贸w g艂贸wnie angloj臋zycznych, badanie efektywno艣ci wybranych metod analizy sentymentu w zastosowaniu do dokument贸w polskoj臋zycznych, propozycje nowych metod, kt贸re mog膮 poprawi膰 jako艣膰 klasyfikacji. W nowych propozycjach nacisk zosta艂 po艂o偶ony na problemy klasyfikacji binarnej, niekorzystanie ze 藕r贸de艂 zewn臋trznych, korzystanie w jak najmniejszym stopniu ze zbioru ucz膮cego. Proponujemy przenie艣膰 ci臋偶ar klasyfikacji tekst贸w z obszernego zbioru ucz膮cego na wyszukiwanie i analizowanie zwi膮zk贸w pomi臋dzy s艂owami tworz膮cymi dokument, a nawet grupami s艂贸w. Zaproponowana metoda ma prost膮 interpretacj臋, mo偶e konkurowa膰 z metodami standardowymi oraz mo偶e by膰 wykorzystana do innych problem贸w zwi膮zanych z ustalaniem sentymentu tekst贸w

    Characterisation of business documents: an approach to the automation of quality assessment

    This thesis explores a new approach to automatic characterisation of business documents of different levels of document effectiveness. Supervised text categorisation techniques are used to derive text features that characterise a specific type of business document in accordance with pre-assigned levels of document utility. The documents in question are the executive summary sections of a representative sample of sales proposal documents. The executive summaries are first rated by domain experts against a quality framework comprising pre-selected dimensions of document quality. An automatic analysis of the texts shows that certain words, word sequences, and patterns of words have the capacity to discriminate between executive summaries of varying levels of document effectiveness. Function words, which are frequently ignored in many text classification tasks, are retained and are shown to provide an important element of the word patterns. Automatic text classifiers that utilise these features are shown to categorise previously unseen executive summaries at an acceptable level of classification performance. The outcomes of the research are applied to the development of a new computer application. The application identifies, in the text of a new executive summary, word patterns that discriminate between sets of summaries previously categorised into different levels of document utility. The action of highlighting the respective categories of discriminating word patterns directs authors to areas of text that may need further attention. A trial of a prototype of the application suggests that it provides an effective way to help sales professionals improve the content and quality of the text of this type of business document. Moreover, as the approach is suitably generic, it could be applied to different types of document in different domains