4,176 research outputs found

    MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach

    Full text link
    Entity linking has recently been the subject of a significant body of research. Currently, the best performing approaches rely on trained mono-lingual models. Porting these approaches to other languages is consequently a difficult endeavor as it requires corresponding training data and retraining of the models. We address this drawback by presenting a novel multilingual, knowledge-based agnostic and deterministic approach to entity linking, dubbed MAG. MAG is based on a combination of context-based retrieval on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data sets and in 7 languages. Our results show that the best approach trained on English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse on datasets in other languages. MAG, on the other hand, achieves state-of-the-art performance on English datasets and reaches a micro F-measure that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc

    Site-Specific Rules Extraction in Precision Agriculture

    Get PDF
    El incremento sostenible en la producción alimentaria para satisfacer las necesidades de una población mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes pérdidas económicas que se producen, el uso de tratamientos químicos es demasiado alto; causando contaminación del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agrícola divisa la aplicación de tratamientos más específicos para cada lugar, así como la validación automática con la conformidad legal. Sin embargo, la especificación de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representación procesable por máquinas está tomando cada vez más importancia en la agricultura de precisión.Actualmente, los requisitos para traducir las regulaciones en reglas formales están lejos de ser cumplidos; y con el rápido desarrollo de la ciencia agrícola, la verificación manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extracción de reglas para destilar de manera efectiva la información relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por máquinas. Para ello, hemos separado la extracción de reglas en dos pasos. El primero es construir una ontología del dominio; un modelo para describir los desórdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer información para poblar la ontología. Puesto que usamos técnicas de aprendizaje automático, implementamos la metodología MATTER para realizar el proceso de anotación de regulaciones. Una vez creado el corpus, construimos un clasificador de categorías de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extracción de restricciones en reglas, que reconoce información relevante para retener el isomorfismo con la regulación original. Para estos componentes, empleamos, entre otra técnicas de aprendizaje profundo, redes neuronales convolucionales y “Long Short- Term Memory”. Además, utilizamos como baselines algoritmos más tradicionales como “support-vector machines” y “random forests”.Como resultado, presentamos la ontología PCT-O, que ha sido alineada con otras ontologías como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificación de desórdenes, el análisis de conflictos entre tratamientos y la comparación entre legislaciones de distintos países. Con respecto a los sistemas de extracción, evaluamos empíricamente el comportamiento con distintas métricas, pero la métrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categorías de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red “bidirectional long short-term memory” con “word embeddings” como entrada. En relación al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinación de “character embeddings” junto a “word embeddings” y una red neuronal “bidirectional long short-term memory”.<br /

    An Ontology-Based Recommender System with an Application to the Star Trek Television Franchise

    Full text link
    Collaborative filtering based recommender systems have proven to be extremely successful in settings where user preference data on items is abundant. However, collaborative filtering algorithms are hindered by their weakness against the item cold-start problem and general lack of interpretability. Ontology-based recommender systems exploit hierarchical organizations of users and items to enhance browsing, recommendation, and profile construction. While ontology-based approaches address the shortcomings of their collaborative filtering counterparts, ontological organizations of items can be difficult to obtain for items that mostly belong to the same category (e.g., television series episodes). In this paper, we present an ontology-based recommender system that integrates the knowledge represented in a large ontology of literary themes to produce fiction content recommendations. The main novelty of this work is an ontology-based method for computing similarities between items and its integration with the classical Item-KNN (K-nearest neighbors) algorithm. As a study case, we evaluated the proposed method against other approaches by performing the classical rating prediction task on a collection of Star Trek television series episodes in an item cold-start scenario. This transverse evaluation provides insights into the utility of different information resources and methods for the initial stages of recommender system development. We found our proposed method to be a convenient alternative to collaborative filtering approaches for collections of mostly similar items, particularly when other content-based approaches are not applicable or otherwise unavailable. Aside from the new methods, this paper contributes a testbed for future research and an online framework to collaboratively extend the ontology of literary themes to cover other narrative content.Comment: 25 pages, 6 figures, 5 tables, minor revision

    On the role of pre and post-processing in environmental data mining

    Get PDF
    The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

    Knowledge Base Population using Semantic Label Propagation

    Get PDF
    A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision, which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both 'Barack Obama' and 'US' may serve as positive training instances for the relation born_in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences do not really express the intended relation. We propose to combine distant supervision with minimal manual supervision in a technique called feature labeling, to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision. We further improve on this approach by introducing the Semantic Label Propagation method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population tasks. The experimental results show that the Semantic Label Propagation strategy leads to substantial performance gains when compared to existing approaches, while requiring an almost negligible manual annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge Bases for Natural Language Processin
    • 

    corecore