4 research outputs found
KDC: uma abordagem baseada em conhecimento para classificação de documentos
Dissertação (mestrado) - Universidade Federal de Santa Catarina, Centro Tecnológico, Programa de Pós-Graduação em Ciência da Computação, Florianópolis, 2015.Classificação de documentos fornece um meio para organizar as informações, permitindo uma melhor compreensão e interpretação dos dados. A tarefa de classificar é caracterizada pela associação de rótulos de classes a documentos com o objetivo de criar agrupamentos semânticos. O aumento exponencial no número de documentos e dados digitais demanda formas mais precisas, abrangentes e eficientes para busca e organização de informações. Nesse contexto, o aprimoramento de técnicas de classificação de documentos com o uso de informação semântica é considerado essencial. Sendo assim, este trabalho propõe uma abordagem baseada em conhecimento para a classificação de documentos. A técnica utiliza termos extraídos de documentos associando-os a conceitos de uma base de conhecimento de domínio aberto. Em seguida, os conceitos são generalizados a um nível maior de abstração. Por fim, é calculado um valor de disparidade entre os conceitos generalizados e o documento, sendo o conceito de menor disparidade considerado como rótulo de classe aplicável ao documento. A aplicação da técnica proposta oferece vantagens sobre os métodos convencionais como a ausência da necessidade de treinamento, a oportunidade de atribuir uma ou múltiplas classes a um documento e a capacidade de aplicação em diferentes temas de classificação sem a necessidade de alterar o classificador.Abstract : Document classification provides a way to organize information, providing a better way to understand available data. The classification task is characterized by the association of class labels to documents, aiming to create semantic clusters. The exponential increase in the number of documents and digital data demands for more precise, comprehensive and efficient ways to search and organize information. In this context, the improvement of document classification techniques using semantic information is considered essential. Thus, this paper proposes a knowledge-based approach for the classification of documents. The technique uses terms extracted from documents in association with concepts of an open domain knowledge base. Then, the concepts are generalized to a higher level of abstraction. Finally a disparity value between generalized concepts and the document is calculated, and the best ranked concept is then considered as a class label applicable to the document. The application of the proposed technique offers advantages over conventional methods including no need for training, the choice to assign one or multiple classes to a document and the capacity to classify over different subjects without the need to change the classifier
Unsupervised multi-label text classification using a world knowledge ontology
The development of text classification techniques has been largely promoted in the past decade due to the increasing availability and widespread use of digital documents. Usually, the performance of text classification relies on the quality of categories and the accuracy of classifiers learned from samples. When training samples are unavailable or categories are unqualified, text classification performance would be degraded. In this paper, we propose an unsupervised multi-label text classification method to classify documents using a large set of categories stored in a world ontology. The approach has been promisingly evaluated by compared with typical text classification methods, using a real-world document collection and based on the ground truth encoded by human experts
Recommended from our members
HOLMES: A Hybrid Ontology-Learning Materials Engineering System
Designing and discovering novel materials is challenging problem in many domains such as fuel additives, composites, pharmaceuticals, and so on. At the core of all this are models that capture how the different domain-specific data, information, and knowledge regarding the structures and properties of the materials are related to one another. This dissertation explores the difficult task of developing an artificial intelligence-based knowledge modeling environment, called Hybrid Ontology-Learning Materials Engineering System (HOLMES) that can assist humans in populating a materials science and engineering ontology through automatic information extraction from journal article abstracts. While what we propose may be adapted for a generic materials engineering application, our focus in this thesis is on the needs of the pharmaceutical industry. We develop the Columbia Ontology for Pharmaceutical Engineering (COPE), which is a modification of the Purdue Ontology for Pharmaceutical Engineering. COPE serves as the basis for HOLMES.
The HOLMES framework starts with journal articles that are in the Portable Document Format (PDF) and ends with the assignment of the entries in the journal articles into ontologies. While this might seem to be a simple task of information extraction, to fully extract the information such that the ontology is filled as completely and correctly as possible is not easy when considering a fully developed ontology.
In the development of the information extraction tasks, we note that there are new problems that have not arisen in previous information extraction work in the literature. The first is the necessity to extract auxiliary information in the form of concepts such as actions, ideas, problem specifications, properties, etc. The second problem is in the existence of multiple labels for a single token due to the existence of the aforementioned concepts. These two problems are the focus of this dissertation.
In this work, the HOLMES framework is presented as a whole, describing our successful progress as well as unsolved problems, which might help future research on this topic. The ontology is then presented to help in the identification of the relevant information that needs to be retrieved. The annotations are next developed to create the data sets necessary for the machine learning algorithms to perform. Then, the current level of information extraction for these concepts is explored and expanded. This is done through the introduction of entity feature sets that are based on previously extracted entities from the entity recognition task. And finally, the new task of handling multiple labels for tagging a single entity is also explored by the use of multiple-label algorithms used primarily in image processing