56 research outputs found
Evolving Lucene search queries for text classification
We describe a method for generating accurate, compact, human
understandable text classifiers. Text datasets are indexed using Apache Lucene and Genetic Programs are used to construct
Lucene search queries. Genetic programs acquire fitness by
producing queries that are effective binary classifiers for a
particular category when evaluated against a set of training
documents. We describe a set of functions and terminals and
provide results from classification tasks
Método hÃbrido para categorización de texto basado en aprendizaje y reglas
En este artÃculo se presenta un nuevo método hÃbrido de categorización automática de texto, que combina un algoritmo de aprendizaje computacional, que permite construir un modelo base de clasificación sin mucho esfuerzo a partir de un corpus etiquetado, con un sistema basado en reglas en cascada que se emplea para filtrar y reordenar los resultados de dicho modelo base. El modelo puede afinarse añadiendo reglas especÃficas para aquellas categorÃas difÃciles que no se han entrenado de forma satisfactoria. Se describe una implementación realizada mediante el algoritmo kNN y un lenguaje básico de reglas basado en listas de términos que aparecen en el texto a clasificar. El sistema se ha evaluado en diferentes escenarios incluyendo el corpus de noticias Reuters-21578 para comparación con otros enfoques, y los modelos IPTC y EUROVOC. Los resultados demuestran que el sistema obtiene una precisión y cobertura comparables con las de los mejores métodos del estado del arte
Automatic document classification of biological literature
Background: Document classification is a wide-spread problem with many applications, from organizing search engine snippets to spam filtering. We previously described Textpresso, a text-mining system for biological literature, which marks up full text according to a shallow ontology that includes terms of biological interest. This project investigates document classification in the context of biological literature, making use of the Textpresso markup of a corpus of Caenorhabditis elegans literature.
Results: We present a two-step text categorization algorithm to classify a corpus of C. elegans papers. Our classification method first uses a support vector machine-trained classifier, followed by a novel, phrase-based clustering algorithm. This clustering step autonomously creates cluster labels that are descriptive and understandable by humans. This clustering engine performed better on a standard test-set (Reuters 21578) compared to previously published results (F-value of 0.55 vs. 0.49), while producing cluster descriptions that appear more useful. A web interface allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept.
Conclusions: We have demonstrated a simple method to classify biological documents that embodies an improvement over current methods. While the classification results are currently optimized for Caenorhabditis elegans papers by human-created rules, the classification engine can be adapted to different types of documents. We have demonstrated this by presenting a web interface that allows researchers to quickly navigate through the hierarchy and look for documents that belong to a specific concept
Label Mask for Multi-Label Text Classification
One of the key problems in multi-label text classification is how to take
advantage of the correlation among labels. However, it is very challenging to
directly model the correlations among labels in a complex and unknown label
space. In this paper, we propose a Label Mask multi-label text classification
model (LM-MTC), which is inspired by the idea of cloze questions of language
model. LM-MTC is able to capture implicit relationships among labels through
the powerful ability of pre-train language models. On the basis, we assign a
different token to each potential label, and randomly mask the token with a
certain probability to build a label based Masked Language Model (MLM). We
train the MTC and MLM together, further improving the generalization ability of
the model. A large number of experiments on multiple datasets demonstrate the
effectiveness of our method
Hybrid Approach Combining Machine Learning and a Rule-Based Expert System for Text Categorization
This paper discusses a novel hybrid approach for text categorization that combines a machine learning algorithm, which provides a base model trained with a labeled corpus, with a rule-based expert system, which is used to improve the results provided by the previous classifier, by filtering false positives and dealing with false negatives. The main advantage is that the system can be easily fine-tuned by adding specific rules for those noisy or conflicting categories that have not been successfully trained. We also describe an implementation based on k-Nearest Neighbor and a simple rule language to express lists of positive, negative and relevant (multiword) terms appearing in the input text. The system is evaluated in several scenarios, including the popular Reuters-21578 news corpus for comparison to other approaches, and categorization using IPTC metadata, EUROVOC thesaurus and others. Results show that this approach achieves a precision that is comparable to top ranked methods, with the added value that it does not require a demanding human expert workload to trai
Machine Learning in Automated Text Categorization
The automated categorization (or classification) of texts into predefined
categories has witnessed a booming interest in the last ten years, due to the
increased availability of documents in digital form and the ensuing need to
organize them. In the research community the dominant approach to this problem
is based on machine learning techniques: a general inductive process
automatically builds a classifier by learning, from a set of preclassified
documents, the characteristics of the categories. The advantages of this
approach over the knowledge engineering approach (consisting in the manual
definition of a classifier by domain experts) are a very good effectiveness,
considerable savings in terms of expert manpower, and straightforward
portability to different domains. This survey discusses the main approaches to
text categorization that fall within the machine learning paradigm. We will
discuss in detail issues pertaining to three different problems, namely
document representation, classifier construction, and classifier evaluation.Comment: Accepted for publication on ACM Computing Survey
Order-free Learning Alleviating Exposure Bias in Multi-label Classification
Multi-label classification (MLC) assigns multiple labels to each sample.
Prior studies show that MLC can be transformed to a sequence prediction problem
with a recurrent neural network (RNN) decoder to model the label dependency.
However, training a RNN decoder requires a predefined order of labels, which is
not directly available in the MLC specification. Besides, RNN thus trained
tends to overfit the label combinations in the training set and have difficulty
generating unseen label sequences. In this paper, we propose a new framework
for MLC which does not rely on a predefined label order and thus alleviates
exposure bias. The experimental results on three multi-label classification
benchmark datasets show that our method outperforms competitive baselines by a
large margin. We also find the proposed approach has a higher probability of
generating label combinations not seen during training than the baseline
models. The result shows that the proposed approach has better generalization
capability
A comparison of Lucene search queries evolved as text classifiers
In this article, we use a genetic algorithm to evolve seven
different types of Lucene search query with the objective of
generating accurate and readable text classifiers. We compare
the effectiveness of each of the different types of query using
three commonly used text datasets. We vary the number of
words available for classification and compare results for 4, 8,
and 16 words per category. The generated queries can also be
viewed as labels for the categories and there is a benefit to a
human analyst in being able to read and tune the classifier.
The evolved queries also provide an explanation of the classification
process. We consider the consistency of the classifiers
and compare their performance on categories of different
complexities. Finally, various approaches to the analysis of
the results are briefly explored
- …