5 research outputs found

    Enhanced ontology-based text classification algorithm for structurally organized documents

    Get PDF
    Text classification (TC) is an important foundation of information retrieval and text mining. The main task of a TC is to predict the text‟s class according to the type of tag given in advance. Most TC algorithms used terms in representing the document which does not consider the relations among the terms. These algorithms represent documents in a space where every word is assumed to be a dimension. As a result such representations generate high dimensionality which gives a negative effect on the classification performance. The objectives of this thesis are to formulate algorithms for classifying text by creating suitable feature vector and reducing the dimension of data which will enhance the classification accuracy. This research combines the ontology and text representation for classification by developing five algorithms. The first and second algorithms namely Concept Feature Vector (CFV) and Structure Feature Vector (SFV), create feature vector to represent the document. The third algorithm is the Ontology Based Text Classification (OBTC) and is designed to reduce the dimensionality of training sets. The fourth and fifth algorithms, Concept Feature Vector_Text Classification (CFV_TC) and Structure Feature Vector_Text Classification (SFV_TC) classify the document to its related set of classes. These proposed algorithms were tested on five different scientific paper datasets downloaded from different digital libraries and repositories. Experimental obtained from the proposed algorithm, CFV_TC and SFV_TC shown better average results in terms of precision, recall, f-measure and accuracy compared against SVM and RSS approaches. The work in this study contributes to exploring the related document in information retrieval and text mining research by using ontology in TC

    Classificação e agregação automática de notícias desportivas

    Get PDF
    Mestrado em Engenharia Informática - Área de Especialização em Arquiteturas, Sistemas e RedesEste relatório foi elaborado no âmbito da dissertação para obtenção do Grau de Mestre em Engenharia Informática do Instituto Superior de Engenharia do Porto Foi desenvolvido com vista o auxílio da implementação de um módulo de classificação e agregação (clustering) automática de notícias desportivas. Este módulo será implementado numa aplicação web relacionada com o desporto a ser desenvolvida futuramente. O principal objetivo do trabalho desenvolvido é perceber entre inúmeras possibilidades existentes para classificação e clustering de documentos quais as que melhor se adequam face às exigências necessárias. Aqueles que apresentaram melhores resultados foram os escolhidos para a fase de implementação do módulo de classificação e clustering de notícias. Em primeiro lugar foi realizado um levantamento do estado da arte de forma a se ter conhecimento de todas as possibilidades existentes. Face a essas possibilidades, foram selecionados dois algoritmos para cada um dos temas a abordar. Os algoritmos escolhidos foram aquelas que se verificaram os mais adequados. Para a classificação foram selecionados o Support Vector Machine (SVM) e K-Nearest Neighbors. Para o clustering, algoritmos hierárquicos e o K-means adaptável. Cada uma dessas possibilidades foi devidamente avaliada de forma a perceber quais as melhores soluções face aos problemas propostos. Foi também feita uma breve abordagem à sumarização de documentos, contudo, este é um tema secundário. O principal foco do trabalho desenvolvido é a classificação e clustering de texto. Este trabalho foi feito em cooperação com LIAAD/INESC TEC - Laboratório de Inteligência Artificial e Apoio à Decisão sob a supervisão do Dr. Nuno EscudeiroThis report has been made as part of the Computer Engineering Master’s dissertation from School of Engineering – Polytechnic of Porto. The report has been developed in order to aid the implementation of an automatic process for sports news classification and clustering. That module will be implemented in a web application related with sports. The main goal for this research is to understand among various possibilities which ones fit best given the necessary requirements of the module to be developed. Those who present the best evaluations will be chosen to be implemented in the classification and clustering module. Firstly has been made a survey of the state of the art in order to have knowledge of all possibilities. Given those possibilities, for each topic were selected two algorithms. The chosen algorithms were those that found to be the most suitable. For text categorization were selected the Support Vector Machine (SVM) and the K-Nearest Neighbors (KNN) algorithms. For document clustering, were selected hierarchical algorithms and the adaptable k-means algorithm. Then, each of these possibilities have been properly evaluated in order to understand which are the best solutions. Was also made a brief approach to the documents summarization, however, this is a secondary topic. The main focus of this report is document classification and clustering. This work was made in cooperation with LIAAD/INESC TEC – “Laboratório de Inteligência Artificial e Apoio à Decisão” with supervision of Dr. Nuno Escudeir

    Classification of RSS feed news items using ontology

    No full text

    A series of case studies to enhance the social utility of RSS

    Get PDF
    RSS (really simple syndication, rich site summary or RDF site summary) is a dialect of XML that provides a method of syndicating on-line content, where postings consist of frequently updated news items, blog entries and multimedia. RSS feeds, produced by organisations or individuals, are often aggregated, and delivered to users for consumption via readers. The semi-structured format of RSS also allows the delivery/exchange of machine-readable content between different platforms and systems. Articles on web pages frequently include icons that represent social media services which facilitate social data. Amongst these, RSS feeds deliver data which is typically presented in the journalistic style of headline, story and snapshot(s). Consequently, applications and academic research have employed RSS on this basis. Therefore, within the context of social media, the question arises: can the social function, i.e. utility, of RSS be enhanced by producing from it data which is actionable and effective? This thesis is based upon the hypothesis that the fluctuations in the keyword frequencies present in RSS can be mined to produce actionable and effective data, to enhance the technology's social utility. To this end, we present a series of laboratory-based case studies which demonstrate two novel and logically consistent RSS-mining paradigms. Our first paradigm allows users to define mining rules to mine data from feeds. The second paradigm employs a semi-automated classification of feeds and correlates this with sentiment. We visualise the outputs produced by the case studies for these paradigms, where they can benefit users in real-world scenarios, varying from statistics and trend analysis to mining financial and sporting data. The contributions of this thesis to web engineering and text mining are the demonstration of the proof of concept of our paradigms, through the integration of an array of open-source, third-party products into a coherent and innovative, alpha-version prototype software implemented in a Java JSP/servlet-based web application architecture
    corecore