65 research outputs found

    Indización automática de artículos científicos sobre Biblioteconomía y Documentación con SISA, KEA y MAUI

    Get PDF
    This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA’s algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación Científica published by the Consejo Superior de Investigaciones Científicas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI.Este artículo evalúa los sistemas de indización automática SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) y MAUI (Multi-Purpose Automatic Topic Indexing) para averiguar cómo funcionan en relación con la indización realzada por especialistas. El algoritmo de SISA se basa en reglas sobre la posición de los términos en los diferentes componentes estructurales del documento, mientras que los algoritmos de KEA y MAUI se basan en el aprendizaje automático y las frecuencia estadística de los términos. Para la evaluación se utilizó una colección documental de 230 artículos científicos de la Revista Española de Documentación Científica, publicada por el Consejo Superior de Investigaciones Científicas (CSIC), de los cuales 30 se utilizaron para tareas formativas y no formaban parte del conjunto de pruebas de evaluación. Los artículos fueron escritos en español e indizados por indizadores humanos utilizando un vocabulario controlado en la base de datos InDICES, también perteneciente al CSIC. La indización humana de estos documentos constituye la referencia contra la cual se evalúa el resultado de los sistemas de indización automáticos, comparando conjuntos de términos usando métricas de evaluación de precisión, recuperación, medida F y consistencia. Los resultados muestran que el sistema SISA funciona mejor, seguido de KEA y MAUI

    Advances in Automatic Keyphrase Extraction

    Get PDF
    The main purpose of this thesis is to analyze and propose new improvements in the field of Automatic Keyphrase Extraction, i.e., the field of automatically detecting the key concepts in a document. We will discuss, in particular, supervised machine learning algorithms for keyphrase extraction, by first identifying their shortcomings and then proposing new techniques which exploit contextual information to overcome them. Keyphrase extraction requires that the key concepts, or \emph{keyphrases}, appear verbatim in the body of the document. We will identify the fact that current algorithms do not use contextual information when detecting keyphrases as one of the main shortcomings of supervised keyphrase extraction. Instead, statistical and positional cues, like the frequency of the candidate keyphrase or its first appearance in the document, are mainly used to determine if a phrase appearing in a document is a keyphrase or not. For this reason, we will prove that a supervised keyphrase extraction algorithm, by using only statistical and positional features, is actually able to extract good keyphrases from documents written in languages that it has never seen. The algorithm will be trained over a common dataset for the English language, a purpose-collected dataset for the Arabic language, and evaluated on the Italian, Romanian and Portuguese languages as well. This result is then used as a starting point to develop new algorithms that use contextual information to increase the performance in automatic keyphrase extraction. The first algorithm that we present uses new linguistics features based on anaphora resolution, which is a field of natural language processing that exploits the relations between elements of the discourse as, e.g., pronouns. We evaluate several supervised AKE pipelines based on these features on the well-known SEMEVAL 2010 dataset, and we show that the performance increases when we add such features to a model that employs statistical and positional knowledge only. Finally, we investigate the possibilities offered by the field of Deep Learning, by proposing six different deep neural networks that perform automatic keyphrase extraction. Such networks are based on bidirectional long-short term memory networks, or on convolutional neural networks, or on a combination of both of them, and on a neural language model which creates a vector representation of each word of the document. These networks are able to learn new features using the the whole document when extracting keyphrases, and they have the advantage of not needing a corpus after being trained to extract keyphrases from new documents. We show that with deep learning based architectures we are able to outperform several other keyphrase extraction algorithms, both supervised and not supervised, used in literature and that the best performances are obtained when we build an additional neural representation of the input document and we append it to the neural language model. Both the anaphora-based and the deep-learning based approaches show that using contextual information, the performance in supervised algorithms for automatic keyphrase extraction improves. In fact, in the methods presented in this thesis, the algorithms which obtained the best performance are the ones receiving more contextual information, both about the relations of the potential keyphrase with other parts of the document, as in the anaphora based approach, and in the shape of a neural representation of the input document, as in the deep learning approach. In contrast, the approach of using statistical and positional knowledge only allows the building of language agnostic keyphrase extraction algorithms, at the cost of decreased precision and recall

    Automatic indexing of scientific articles on Library and Information Science with SISA, KEA and MAUI

    Get PDF
    This article evaluates the SISA (Automatic Indexing System), KEA (Keyphrase Extraction Algorithm) and MAUI (Multi-Purpose Automatic Topic Indexing) automatic indexing systems to find out how they perform in relation to human indexing. SISA algorithm is based on rules about the position of terms in the different structural components of the document, while the algorithms for KEA and MAUI are based on machine learning and the statistical features of terms. For evaluation purposes, a document collection of 230 scientific articles from the Revista Española de Documentación Científica published by the Consejo Superior de Investigaciones Científicas (CSIC) was used, of which 30 were used for training tasks and were not part of the evaluation test set. The articles were written in Spanish and indexed by human indexers using a controlled vocabulary in the InDICES database, also belonging to the CSIC. The human indexing of these documents constitutes the baseline or golden indexing, against which to evaluate the output of the automatic indexing systems by comparing terms sets using the evaluation metrics of precision, recall, F-measure and consistency. The results show that the SISA system performs best, followed by KEA and MAUI

    Human-competitive automatic topic indexing

    Get PDF
    Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

    Journalism 3.0: Multidimensional Cluster Visualization and Labelling on Twitter Data for Data Journalism

    Get PDF
    A tese pretende auxiliar o processo jornalístico de descobrimento de notícias com recurso à rede de micro-blogging Twitter. Esta pretende continuar o desenvolvimento da ferramenta TweeProfiles, um consumidor da stream de tweets com posterior descoberta de padrões usando clustering. Para tal, são realizadas tarefas com o objetivo de entender o negócio e de modificar a ferramenta, front-end e back-end, de acordo com o conhecimento obtido. O conhecimento do processo jornalístico é obtido através de reuniões, sessões de inquéritos e testes de usabilidade realizados com o auxílio da equipa do JPN, JornalismoPortoNet. A ferramenta é modificada em front-end pelas respostas recebidas a propostas de caso de uso e sugestões. O back-end é modificado com a adição da tarefa de extração de tópicos realizada nos tweets obtidos, para sumarização de texto

    Proceedings of the EACL Hackashop on News Media Content Analysis and Automated Report Generation

    Get PDF
    Peer reviewe

    The Future of Information Sciences : INFuture2009 : Digital Resources and Knowledge Sharing

    Get PDF

    Recent Advances in Social Data and Artificial Intelligence 2019

    Get PDF
    The importance and usefulness of subjects and topics involving social data and artificial intelligence are becoming widely recognized. This book contains invited review, expository, and original research articles dealing with, and presenting state-of-the-art accounts pf, the recent advances in the subjects of social data and artificial intelligence, and potentially their links to Cyberspace

    Semantic and pragmatic characterization of learning objects

    Get PDF
    Tese de doutoramento. Engenharia Informática. Universidade do Porto. Faculdade de Engenharia. 201

    Semi-automated Ontology Generation for Biocuration and Semantic Search

    Get PDF
    Background: In the life sciences, the amount of literature and experimental data grows at a tremendous rate. In order to effectively access and integrate these data, biomedical ontologies – controlled, hierarchical vocabularies – are being developed. Creating and maintaining such ontologies is a difficult, labour-intensive, manual process. Many computational methods which can support ontology construction have been proposed in the past. However, good, validated systems are largely missing. Motivation: The biocuration community plays a central role in the development of ontologies. Any method that can support their efforts has the potential to have a huge impact in the life sciences. Recently, a number of semantic search engines were created that make use of biomedical ontologies for document retrieval. To transfer the technology to other knowledge domains, suitable ontologies need to be created. One area where ontologies may prove particularly useful is the search for alternative methods to animal testing, an area where comprehensive search is of special interest to determine the availability or unavailability of alternative methods. Results: The Dresden Ontology Generator for Directed Acyclic Graphs (DOG4DAG) developed in this thesis is a system which supports the creation and extension of ontologies by semi-automatically generating terms, definitions, and parent-child relations from text in PubMed, the web, and PDF repositories. The system is seamlessly integrated into OBO-Edit and Protégé, two widely used ontology editors in the life sciences. DOG4DAG generates terms by identifying statistically significant noun-phrases in text. For definitions and parent-child relations it employs pattern-based web searches. Each generation step has been systematically evaluated using manually validated benchmarks. The term generation leads to high quality terms also found in manually created ontologies. Definitions can be retrieved for up to 78% of terms, child ancestor relations for up to 54%. No other validated system exists that achieves comparable results. To improve the search for information on alternative methods to animal testing an ontology has been developed that contains 17,151 terms of which 10% were newly created and 90% were re-used from existing resources. This ontology is the core of Go3R, the first semantic search engine in this field. When a user performs a search query with Go3R, the search engine expands this request using the structure and terminology of the ontology. The machine classification employed in Go3R is capable of distinguishing documents related to alternative methods from those which are not with an F-measure of 90% on a manual benchmark. Approximately 200,000 of the 19 million documents listed in PubMed were identified as relevant, either because a specific term was contained or due to the automatic classification. The Go3R search engine is available on-line under www.Go3R.org
    corecore