10 research outputs found

    Evolving rules for document classification

    Get PDF
    We describe a novel method for using Genetic Programming to create compact classification rules based on combinations of N-Grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that because the induced rules are meaningful to a human analyst they may have a number of other uses beyond classification and provide a basis for text mining applications

    Evolving Lucene search queries for text classification

    Get PDF
    We describe a method for generating accurate, compact, human understandable text classifiers. Text datasets are indexed using Apache Lucene and Genetic Programs are used to construct Lucene search queries. Genetic programs acquire fitness by producing queries that are effective binary classifiers for a particular category when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from classification tasks

    Evolving text classification rules with genetic programming

    Get PDF
    We describe a novel method for using genetic programming to create compact classification rules using combinations of N-grams (character strings). Genetic programs acquire fitness by producing rules that are effective classifiers in terms of precision and recall when evaluated against a set of training documents. We describe a set of functions and terminals and provide results from a classification task using the Reuters 21578 dataset. We also suggest that the rules may have a number of other uses beyond classification and provide a basis for text mining applications

    Document clustering with evolved search queries

    Get PDF
    Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm

    SPAM detection: Naïve bayesian classification and RPN expression-based LGP approaches compared

    Get PDF
    An investigation is performed of a machine learning algorithm and the Bayesian classifier in the spam-filtering context. The paper shows the advantage of the use of Reverse Polish Notation (RPN) expressions with feature extraction compared to the traditional Naïve Bayesian classifier used for spam detection assuming the same features. The performance of the two is investigated using a public corpus and a recent private spam collection, concluding that the system based on RPN LGP (Linear Genetic Programming) gave better results compared to two popularly used open source Bayesian spam filters. © Springer International Publishing Switzerland 2016

    Document Clustering with Evolved Single Word Search Queries

    Get PDF
    We present a novel, hybrid approach for clustering text databases. We use a genetic algorithm to generate and evolve a set of single word search queries in Apache Lucene format. Clusters are formed as the set of documents matching a search query. The queries are optimized to maximize the number of documents returned and to minimize the overlap between clusters (documents returned by more than one query in a set). Optionally, the number of clusters can be specified in advance, which will normally result in an improvement in performance. Not all documents in a collection are returned by any of the search queries in a set, so once the search query evolution is completed a second stage is performed whereby a KNN algorithm is applied to assign all unassigned documents to their nearest cluster. We describe the method and compare effectiveness with other well-known existing systems on 8 different text datasets. We note that search query format has the qualitative benefits of being interpretable and providing an explanation of cluster construction

    Categorización de texto usando técnicas de machine learning aplicado a la clasificación de reclamos en los procesos de la Universidad Tecnológica de Bolívar /

    Get PDF
    En la mayoría de las organizaciones no solo es suficiente con lograr los objetivos propuestos, también es relevante la forma como estos son alcanzados, es por esto que las organizaciones se encuentran en una constante búsqueda de mecanismos que les permitan mejorar los procesos utilizados en la consecución de los objetivos. Uno de estos procesos es el relacionado con el manejo de las quejas internas que se presentan en las empresas. Todos los procesos que existen en una organización deben ser de una u otra forma evaluados con el fin de garantizar la calidad de los mismos y evaluar las oportunidades de mejoras y las necesidades de cambio de estos. Mediante la recolección de reclamos las organizaciones se aseguran de mantener la calidad de todos sus procesos y asegurar la continua conveniencia, adecuación y eficacia de estos. Es por eso que los procesos de gestión de reclamos y quejas juegan un papel importante en el funcionamiento de cualquier tipo de empresa, y la Universidad Tecnológica de Bolívar no es la excepciónIncluye bibliografía, anexo

    Management, Technology and Learning for Individuals, Organisations and Society in Turbulent Environments

    Get PDF
    This book presents the collection of fifty two papers which were presented on the First International Conference on BUSINESS SUSTAINABILITY ’08 - Management, Technology and Learning for Individuals, Organisations and Society in Turbulent Environments, held in Ofir, Portugal, from 25th to 27th of June, 2008. The main motive of the meeting was the growing awareness of the importance of the sustainability issue. This importance had emerged from the growing uncertainty of the market behaviour that leads to the characterization of the market, i.e. environment, as turbulent. Actually, the characterization of the environment as uncertain and turbulent reflects the fact that the traditional technocratic and/or socio-technical approaches cannot effectively and efficiently lead with the present situation. In other words, the rise of the sustainability issue means the quest for new instruments to deal with uncertainty and/or turbulence. The sustainability issue has a complex nature and solutions are sought in a wide range of domains and instruments to achieve and manage it. The domains range from environmental sustainability (referring to natural environment) through organisational and business sustainability towards social sustainability. Concerning the instruments for sustainability, they range from traditional engineering and management methodologies towards “soft” instruments such as knowledge, learning, creativity. The papers in this book address virtually whole sustainability problems space in a greater or lesser extent. However, although the uncertainty and/or turbulence, or in other words the dynamic properties, come from coupling of management, technology, learning, individuals, organisations and society, meaning that everything is at the same time effect and cause, we wanted to put the emphasis on business with the intention to address primarily the companies and their businesses. From this reason, the main title of the book is “Business Sustainability” but with the approach of coupling Management, Technology and Learning for individuals, organisations and society in Turbulent Environments. Concerning the First International Conference on BUSINESS SUSTAINABILITY, its particularity was that it had served primarily as a learning environment in which the papers published in this book were the ground for further individual and collective growth in understanding and perception of sustainability and capacity for building new instruments for business sustainability. In that respect, the methodology of the conference work was basically dialogical, meaning promoting dialog on the papers, but also including formal paper presentations. In this way, the conference presented a rich space for satisfying different authors’ and participants’ needs. Additionally, promoting the widest and global learning environment and participativeness, the Conference Organisation provided the broadcasting over Internet of the Conference sessions, dialogical and formal presentations, for all authors’ and participants’ institutions, as an innovative Conference feature. In these terms, this book could also be understood as a complementary instrument to the Conference authors’ and participants’, but also to the wider readerships’ interested in the sustainability issues. The book brought together 97 authors from 10 countries, namely from Australia, Finland, France, Germany, Ireland, Portugal, Russia, Serbia, Sweden and United Kingdom. The authors “ranged” from senior and renowned scientists to young researchers providing a rich and learning environment. At the end, the editors hope and would like that this book will be useful, meeting the expectation of the authors and wider readership and serving for enhancing the individual and collective learning, and to incentive further scientific development and creation of new papers. Also, the editors would use this opportunity to announce the intention to continue with new editions of the conference and subsequent editions of accompanying books on the subject of BUSINESS SUSTAINABILITY, the second of which is planned for year 2011.info:eu-repo/semantics/publishedVersio

    Autonomous Document Classification for Business

    No full text
    With the continuing exponential growth of the Internet and the more recent growth of business Intranets, the commercial world is becoming increasingly aware of the problem of electronic information overload. This has encouraged interest in developing agents/softbots that can act as electronic personal assistants and can develop and adapt representations of users information needs, commonly known as profiles. As th
    corecore