116 research outputs found

    Identifying and Extracting Named Entities from Wikipedia Database Using Entity Infoboxes

    An approach for named entity classification based on Wikipedia article infoboxes is described in this paper. It identifies the three fundamental named entity types, namely; Person, Location and Organization. An entity classification is accomplished by matching entity attributes extracted from the relevant entity article infobox against core entity attributes built from Wikipedia Infobox Templates. Experimental results showed that the classifier can achieve a high accuracy and F-measure scores of 97%. Based on this approach, a database of around 1.6 million 3-typed named entities is created from 20140203 Wikipedia dump. Experiments on CoNLL2003 shared task named entity recognition (NER) dataset disclosed the system’s outstanding performance in comparison to three different state-of-the-art systems

    NEREA: Named entity recognition and disambiguation exploiting local document repositories

    In this work, we describe the design, development, and deployment of NEREA (Named Entity Recognizer for spEcific Areas), an automatic Named Entity Recognizer and Disambiguation system, developed in collaboration with professional documentalists. The aim of NEREA is to keep accurate and current information about the entities mentioned in a local repository, and then support building appropriate infoboxes, setting out the main data of these entities. It achieves a high performance thanks to the use of classification resources belonging to the local database. With this aim, the system performs tasks of named entity recognition and disambiguation by using three types of knowledge bases: local classification resources, global databases like DBpedia, and its own catalog created by NEREA. The proposed method has been validated with two different datasets and its operation has been tested in English and Spanish. The working methodology is being applied in a real environment of a media with promising results

    Desarrollo de un sistema para la población de bases de conocimiento en la Web de datos

    Durante las últimas décadas, el uso de la World Wide Web ha estado creciendo de forma exponencial, en gran parte gracias a la capacidad de los usuarios de aportar contenidos. Esta expansión ha convertido a la Web en una gran fuente de datos heterogénea. Sin embargo, la Web estaba orientada a las personas y no al procesado automático de la información por parte de agentes software. Para facilitar esto, han surgido diferentes iniciativas, metodologías y tecnologías agrupadas bajo las denominaciones de Web Semántica (Semantic Web), y Web de datos enlazados (Web of Linked Data). Sus pilares fundamentales son las ontologías, definidas como especificaciones explícitas formales de acuerdo a una conceptualización, y las bases de conocimiento (Knowledge Bases), repositorios con datos modelados según una ontología. Muchas de estas bases de conocimiento son pobladas con datos de forma manual, mientras que otras usan como fuente páginas web de las que se extrae la información mediante técnicas automáticas. Un ejemplo de esto último es DBpedia, cuyos datos son obtenidos de los infoboxes, pequeñas cajas de información estructurada que acompañan a cada artículo de Wikipedia. Actualmente, uno de los grandes problemas de estas bases de conocimiento es la gran cantidad de errores e inconsistencias en los datos, la falta de precisión y la ausencia de enlaces o relaciones entre datos que deberían estar relacionados. Estos problemas son, en parte, debidos al desconocimiento de los usuarios sobre los procesos de inserción de datos. La falta de información sobre la estructura de las bases de conocimiento provoca que no sepan qué pueden o deben introducir, ni en qué forma deben hacerlo. Por otra parte, aunque existen técnicas automáticas de inserción de datos, suelen tener un rendimiento más bajo que usuarios especialistas, sobre todo si las fuentes usadas son de baja calidad. Este proyecto plantea el análisis, diseño y desarrollo de un sistema que ayuda a los usuarios a crear contenido para poblar bases de conocimiento. Dicho sistema proporciona al usuario información sobre qué datos y metadatos pueden introducirse y qué formato deben emplear, sugiriéndoles posibles valores para diferentes campos, y ayudándoles a relacionar los nuevos datos con datos ya existentes cuando sea posible. Para ello, el sistema hace uso tanto de técnicas estadísticas sobre datos ya introducidos, como de técnicas semánticas sobre las posibles relaciones y restricciones definidas en la base de conocimiento con la que se trabaja. Además, el sistema desarrollado está accesible como aplicación web (http://sid.cps.unizar.es/Infoboxer), es adaptable a distintas bases de conocimiento y permite exportar el contenido creado en diferentes formatos, incluyendo RDF e infobox de Wikipedia. Por último señalar que el sistema ha sido probado en tres evaluaciones con usuarios, en las que ha demostrado su efectividad y sencillez para crear contenido de mayor calidad que sin su uso, y que se han escrito dos artículos de investigación sobre este trabajo; uno de ellos aceptado para su exposición y publicación en las XXI Jornadas de Ingeniería del Software y Bases de Datos (JISBD), y el otro en proceso de revisión en la 15th International Semantic Web Conference (ISWC)

    InfoSync: Information Synchronization across Multilingual Semi-structured Tables

    Information Synchronization of semi-structured data across languages is challenging. For instance, Wikipedia tables in one language should be synchronized across languages. To address this problem, we introduce a new dataset InfoSyncC and a two-step method for tabular synchronization. InfoSync contains 100K entity-centric tables (Wikipedia Infoboxes) across 14 languages, of which a subset (3.5K pairs) are manually annotated. The proposed method includes 1) Information Alignment to map rows and 2) Information Update for updating missing/outdated information for aligned tables across multilingual tables. When evaluated on InfoSync, information alignment achieves an F1 score of 87.91 (en non-en). To evaluate information updation, we perform human-assisted Wikipedia edits on Infoboxes for 603 table pairs. Our approach obtains an acceptance rate of 77.28% on Wikipedia, showing the effectiveness of the proposed method.Comment: 22 pages, 7 figures, 20 tables, ACL 2023 (Toronto, Canada

    Requirements Analysis for an Open Research Knowledge Graph

    Current science communication has a number of drawbacks and bottlenecks which have been subject of discussion lately: Among others, the rising number of published articles makes it nearly impossible to get an overview of the state of the art in a certain field, or reproducibility is hampered by fixed-length, document-based publications which normally cannot cover all details of a research work. Recently, several initiatives have proposed knowledge graphs (KGs) for organising scientific information as a solution to many of the current issues. The focus of these proposals is, however, usually restricted to very specific use cases. In this paper, we aim to transcend this limited perspective by presenting a comprehensive analysis of requirements for an Open Research Knowledge Graph (ORKG) by (a) collecting daily core tasks of a scientist, (b) establishing their consequential requirements for a KG-based system, (c) identifying overlaps and specificities, and their coverage in current solutions. As a result, we map necessary and desirable requirements for successful KG-based science communication, derive implications and outline possible solutions.Comment: Accepted for publishing in 24th International Conference on Theory and Practice of Digital Libraries, TPDL 202

    Populating knowledge bases with temporal information

    Recent progress in information extraction has enabled the automatic construction of large knowledge bases. Knowledge bases contain millions of entities (e.g. persons, organizations, events, etc.), their semantic classes, and facts about them. Knowledge bases have become a great asset for semantic search, entity linking, deep analytics, and question answering. However, a common limitation of current knowledge bases is the poor coverage of temporal knowledge. First of all, so far, knowledge bases have focused on popular events and ignored long tail events such as political scandals, local festivals, or protests. Secondly, they do not cover the textual phrases denoting events and temporal facts at all. The goal of this dissertation, thus, is to automatically populate knowledge bases with this kind of temporal knowledge. The dissertation makes the following contributions to address the afore mentioned limitations. The first contribution is a method for extracting events from news articles. The method reconciles the extracted events into canonicalized representations and organizes them into fine-grained semantic classes. The second contribution is a method for mining the textual phrases denoting the events and facts. The method infers the temporal scopes of these phrases and maps them to a knowledge base. Our experimental evaluations demonstrate that our methods yield high quality output compared to state-of- the-art approaches, and can indeed populate knowledge bases with temporal knowledge.Der Fortschritt in der Informationsextraktion ermöglicht heute das automatischen Erstellen von Wissensbasen. Derartige Wissensbasen enthalten Entitäten wie Personen, Organisationen oder Events sowie Informationen über diese und deren semantische Klasse. Automatisch generierte Wissensbasen bilden eine wesentliche Grundlage für das semantische Suchen, das Verknüpfen von Entitäten, die Textanalyse und für natürlichsprachliche Frage-Antwortsysteme. Eine Schwäche aktueller Wissensbasen ist jedoch die unzureichende Erfassung von temporalen Informationen. Wissenbasen fokussieren in erster Linie auf populäre Events und ignorieren weniger bekannnte Events wie z.B. politische Skandale, lokale Veranstaltungen oder Demonstrationen. Zudem werden Textphrasen zur Bezeichung von Events und temporalen Fakten nicht erfasst. Ziel der vorliegenden Arbeit ist es, Methoden zu entwickeln, die temporales Wissen au- tomatisch in Wissensbasen integrieren. Dazu leistet die Dissertation folgende Beiträge: 1. Die Entwicklung einer Methode zur Extrahierung von Events aus Nachrichtenartikeln sowie deren Darstellung in einer kanonischen Form und ihrer Einordnung in detaillierte semantische Klassen. 2. Die Entwicklung einer Methode zur Gewinnung von Textphrasen, die Events und Fakten in Wissensbasen bezeichnen sowie einer Methode zur Ableitung ihres zeitlichen Verlaufs und ihrer Dauer. Unsere Experimente belegen, dass die von uns entwickelten Methoden zu qualitativ deutlich besseren Ausgabewerten führen als bisherige Verfahren und Wissensbasen tatsächlich um temporales Wissen erweitern können

    Towards Building a Knowledge Base of Monetary Transactions from a News Collection

    We address the problem of extracting structured representations of economic events from a large corpus of news articles, using a combination of natural language processing and machine learning techniques. The developed techniques allow for semi-automatic population of a financial knowledge base, which, in turn, may be used to support a range of data mining and exploration tasks. The key challenge we face in this domain is that the same event is often reported multiple times, with varying correctness of details. We address this challenge by first collecting all information pertinent to a given event from the entire corpus, then considering all possible representations of the event, and finally, using a supervised learning method, to rank these representations by the associated confidence scores. A main innovative element of our approach is that it jointly extracts and stores all attributes of the event as a single representation (quintuple). Using a purpose-built test set we demonstrate that our supervised learning approach can achieve 25% improvement in F1-score over baseline methods that consider the earliest, the latest or the most frequent reporting of the event.Comment: Proceedings of the 17th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL '17), 201