4,176 research outputs found
MAG: A Multilingual, Knowledge-base Agnostic and Deterministic Entity Linking Approach
Entity linking has recently been the subject of a significant body of
research. Currently, the best performing approaches rely on trained
mono-lingual models. Porting these approaches to other languages is
consequently a difficult endeavor as it requires corresponding training data
and retraining of the models. We address this drawback by presenting a novel
multilingual, knowledge-based agnostic and deterministic approach to entity
linking, dubbed MAG. MAG is based on a combination of context-based retrieval
on structured knowledge bases and graph algorithms. We evaluate MAG on 23 data
sets and in 7 languages. Our results show that the best approach trained on
English datasets (PBOH) achieves a micro F-measure that is up to 4 times worse
on datasets in other languages. MAG, on the other hand, achieves
state-of-the-art performance on English datasets and reaches a micro F-measure
that is up to 0.6 higher than that of PBOH on non-English languages.Comment: Accepted in K-CAP 2017: Knowledge Capture Conferenc
Site-Specific Rules Extraction in Precision Agriculture
El incremento sostenible en la produccioÌn alimentaria para satisfacer las necesidades de una poblacioÌn mundial en aumento es un verdadero reto cuando tenemos en cuenta el impacto constante de plagas y enfermedades en los cultivos. Debido a las importantes peÌrdidas econoÌmicas que se producen, el uso de tratamientos quiÌmicos es demasiado alto; causando contaminacioÌn del medio ambiente y resistencia a distintos tratamientos. En este contexto, la comunidad agriÌcola divisa la aplicacioÌn de tratamientos maÌs especiÌficos para cada lugar, asiÌ como la validacioÌn automaÌtica con la conformidad legal. Sin embargo, la especificacioÌn de estos tratamientos se encuentra en regulaciones expresadas en lenguaje natural. Por este motivo, traducir regulaciones a una representacioÌn procesable por maÌquinas estaÌ tomando cada vez maÌs importancia en la agricultura de precisioÌn.Actualmente, los requisitos para traducir las regulaciones en reglas formales estaÌn lejos de ser cumplidos; y con el raÌpido desarrollo de la ciencia agriÌcola, la verificacioÌn manual de la conformidad legal se torna inabordable.En esta tesis, el objetivo es construir y evaluar un sistema de extraccioÌn de reglas para destilar de manera efectiva la informacioÌn relevante de las regulaciones y transformar las reglas de lenguaje natural a un formato estructurado que pueda ser procesado por maÌquinas. Para ello, hemos separado la extraccioÌn de reglas en dos pasos. El primero es construir una ontologiÌa del dominio; un modelo para describir los desoÌrdenes que producen las enfermedades en los cultivos y sus tratamientos. El segundo paso es extraer informacioÌn para poblar la ontologiÌa. Puesto que usamos teÌcnicas de aprendizaje automaÌtico, implementamos la metodologiÌa MATTER para realizar el proceso de anotacioÌn de regulaciones. Una vez creado el corpus, construimos un clasificador de categoriÌas de reglas que discierne entre obligaciones y prohibiciones; y un sistema para la extraccioÌn de restricciones en reglas, que reconoce informacioÌn relevante para retener el isomorfismo con la regulacioÌn original. Para estos componentes, empleamos, entre otra teÌcnicas de aprendizaje profundo, redes neuronales convolucionales y âLong Short- Term Memoryâ. AdemaÌs, utilizamos como baselines algoritmos maÌs tradicionales como âsupport-vector machinesâ y ârandom forestsâ.Como resultado, presentamos la ontologiÌa PCT-O, que ha sido alineada con otras ontologiÌas como NCBI, PubChem, ChEBI y Wikipedia. El modelo puede ser utilizado para la identificacioÌn de desoÌrdenes, el anaÌlisis de conflictos entre tratamientos y la comparacioÌn entre legislaciones de distintos paiÌses. Con respecto a los sistemas de extraccioÌn, evaluamos empiÌricamente el comportamiento con distintas meÌtricas, pero la meÌtrica F1 es utilizada para seleccionar los mejores sistemas. En el caso del clasificador de categoriÌas de reglas, el mejor sistema obtiene un macro F1 de 92,77% y un F1 binario de 85,71%. Este sistema usa una red âbidirectional long short-term memoryâ con âword embeddingsâ como entrada. En relacioÌn al extractor de restricciones de reglas, el mejor sistema obtiene un micro F1 de 88,3%. Este extractor utiliza como entrada una combinacioÌn de âcharacter embeddingsâ junto a âword embeddingsâ y una red neuronal âbidirectional long short-term memoryâ.<br /
An Ontology-Based Recommender System with an Application to the Star Trek Television Franchise
Collaborative filtering based recommender systems have proven to be extremely
successful in settings where user preference data on items is abundant.
However, collaborative filtering algorithms are hindered by their weakness
against the item cold-start problem and general lack of interpretability.
Ontology-based recommender systems exploit hierarchical organizations of users
and items to enhance browsing, recommendation, and profile construction. While
ontology-based approaches address the shortcomings of their collaborative
filtering counterparts, ontological organizations of items can be difficult to
obtain for items that mostly belong to the same category (e.g., television
series episodes). In this paper, we present an ontology-based recommender
system that integrates the knowledge represented in a large ontology of
literary themes to produce fiction content recommendations. The main novelty of
this work is an ontology-based method for computing similarities between items
and its integration with the classical Item-KNN (K-nearest neighbors)
algorithm. As a study case, we evaluated the proposed method against other
approaches by performing the classical rating prediction task on a collection
of Star Trek television series episodes in an item cold-start scenario. This
transverse evaluation provides insights into the utility of different
information resources and methods for the initial stages of recommender system
development. We found our proposed method to be a convenient alternative to
collaborative filtering approaches for collections of mostly similar items,
particularly when other content-based approaches are not applicable or
otherwise unavailable. Aside from the new methods, this paper contributes a
testbed for future research and an online framework to collaboratively extend
the ontology of literary themes to cover other narrative content.Comment: 25 pages, 6 figures, 5 tables, minor revision
On the role of pre and post-processing in environmental data mining
The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
- âŠ