4 research outputs found

    Machine Learning in Automated Text Categorization

    Full text link
    The automated categorization (or classification) of texts into predefined categories has witnessed a booming interest in the last ten years, due to the increased availability of documents in digital form and the ensuing need to organize them. In the research community the dominant approach to this problem is based on machine learning techniques: a general inductive process automatically builds a classifier by learning, from a set of preclassified documents, the characteristics of the categories. The advantages of this approach over the knowledge engineering approach (consisting in the manual definition of a classifier by domain experts) are a very good effectiveness, considerable savings in terms of expert manpower, and straightforward portability to different domains. This survey discusses the main approaches to text categorization that fall within the machine learning paradigm. We will discuss in detail issues pertaining to three different problems, namely document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Analysis and implementation of methods for the text categorization

    Get PDF
    Text Categorization (TC) is the automatic classification of text documents under pre-defined categories, or classes. Popular TC approaches map categories into symbolic labels and use a training set of documents, previously labeled by human experts, to build a classifier which enables the automatic TC of unlabeled documents. Suitable TC methods come from the field of data mining and information retrieval, however the following issues remain unsolved. First, the classifier performance depends heavily on hand-labeled documents that are the only source of knowledge for learning the classifier. Being a labor-intensive and time consuming activity, the manual attribution of documents to categories is extremely costly. This creates a serious limitations when a set of manual labeled data is not available, as it happens in most cases. Second, even a moderately sized text collection often has tens of thousands of terms in that making the classification cost prohibitive for learning algorithms that do not scale well to large problem sizes. Most important, TC should be based on the text content rather than on a set of hand-labeled documents whose categorization depends on the subjective judgment of a human classifier. This thesis aims at facing the above issues by proposing innovative approaches which leverage techniques from data mining and information retrieval. To face problems about both the high dimensionality of the text collection and the large number of terms in a single text, the thesis proposes a hybrid model for term selection which combines and takes advantage of both filter and wrapper approaches. In detail, the proposed model uses a filter to rank the list of terms present in documents to ensure that useful terms are unlikely to be screened out. Next, to limit classification problems due to the correlation among terms, this ranked list is refined by a wrapper that uses a Genetic Algorithm (GA) to retaining the most informative and discriminative terms. Experimental results compare well with some of the top-performing learning algorithms for TC and seems to confirm the effectiveness of the proposed model. To face the issues about the lack and the subjectivity of manually labeled datasets, the basic idea is to use an ontology-based approach which does not depend on the existence of a training set and relies solely on a set of concepts within a given domain and the relationships between concepts. In this regard, the thesis proposes a text categorization approach that applies WordNet for selecting the correct sense of words in a document, and utilizes domain names in WordNet Domains for classification purposes. Experiments show that the proposed approach performs well in classifying a large corpus of documents. This thesis contributes to the area of data mining and information retrieval. Specifically, it introduces and evaluates novel techniques to the field of text categorization. The primary objective of this thesis is to test the hypothesis that: text categorization requires and benefits from techniques designed to exploit document content. hybrid methods from data mining and information retrieval can better support problems about high dimensionality that is the main aspect of large document collections. in absence of manually annotated documents, WordNet domain abstraction can be used that is both useful and general enough to categorize any documents collection. As a final remark, it is important to acknowledge that much of the inspiration and motivation for this work derived from the vision of the future of text categorization processes which are related to specific application domains such as the business area and the industrial sectors, just to cite a few. In the end, it is this vision that provided the guiding framework. However, it is equally important to understand that many of the results and techniques developed in this thesis are not limited to text categorization. For example, the evaluation of disambiguation methods is interesting in its own right and is likely to be relevant to other application fields

    Ampliaci贸n y perfeccionamiento de los m茅todos cuantitativos y leyes cl谩sicas en recuperaci贸n de la informaci贸n: desarrollo de un sistema de indizaci贸n y segmentaci贸n autom谩tica para textos en espa帽ol

    Full text link
    Se desarrolla e implementa un Sistema de Indizaci贸n y Segmentaci贸n Autom谩tica para textos largos en espa帽ol, contribuyendo a su categorizaci贸n textual e indizaci贸n autom谩tica. Para su desarrollo, se estudian y perfeccionan los m茅todos cuantitativos y leyes cl谩sicas en Recuperaci贸n de Informaci贸n, como son los modelos relativos al proceso de repetici贸n de palabras (Zipf, 1949), (Mandelbrot, 1953) y al proceso de creaci贸n de vocabulario (Heaps, 1978). Se realiza una cr铆tica de las circunstancias de aplicaci贸n de los modelos y se estudia la estabilidad de los par谩metros de manera experimental mediante recuentos en textos y sus fragmentos. Se establecen recomendaciones a priori para los valores de sus par谩metros, dependiendo de las circunstancias de aplicaci贸n y del tipo de texto analizado. Se observa el comportamiento de los par谩metros de las f贸rmulas para vislumbrar una relaci贸n directa con la tipolog铆a de texto analizado. Se propone un nuevo modelo (Log-%) para la visualizaci贸n de la distribuci贸n de frecuencias de las palabras de un texto. El objetivo final es detectar los cambios tem谩ticos que se producen en un documento, para establecer su estructura tem谩tica y obtener la indizaci贸n autom谩tica de cada una de sus partes. De este modo, se obtiene la categorizaci贸n del texto o documento utilizando la enumeraci贸n de sus partes tem谩ticas a modo de niveles o estructura arb贸rea. Una vez constituidas las partes tem谩ticas del texto en sus niveles correspondientes con los t茅rminos indizados, estos se agrupan en bloques distribuidos jer谩rquicamente seg煤n se desglose el documento en cuesti贸n. El bloque inicial describe el contenido global de todo el documento con una cantidad inicial de palabras o descriptores. Seguidamente este bloque inicial se subdivide en varios bloques, los cuales corresponden a distintas partes del documento total, cada uno de estos tambi茅n contiene una serie de palabras que describe el contenido y as铆 sucesivamente hasta poder formar las div....Rodr铆guez Luna, M. (2013). Ampliaci贸n y perfeccionamiento de los m茅todos cuantitativos y leyes cl谩sicas en recuperaci贸n de la informaci贸n: desarrollo de un sistema de indizaci贸n y segmentaci贸n autom谩tica para textos en espa帽ol [Tesis doctoral no publicada]. Universitat Polit猫cnica de Val猫ncia. https://doi.org/10.4995/Thesis/10251/31517Palanci