Search CORE

76 research outputs found

Schema-Driven Information Extraction from Heterogeneous Tables

Author: Bai Fan
Freitag Dayne
Kang Junmo
Ritter Alan
Stanovsky Gabriel
Publication venue
Publication date: 15/11/2023
Field of study

In this paper, we explore the question of whether large language models can support cost-efficient information extraction from tables. We introduce schema-driven information extraction, a new task that transforms tabular data into structured records following a human-authored schema. To assess various LLM's capabilities on this task, we develop a benchmark composed of tables from four diverse domains: machine learning papers, chemistry literature, material science journals, and webpages. Alongside the benchmark, we present an extraction method based on instruction-tuned LLMs. Our approach shows competitive performance without task-specific labels, achieving F1 scores ranging from 74.2 to 96.1, while maintaining great cost efficiency. Moreover, we validate the possibility of distilling compact table-extraction models to reduce API reliance, as well as extraction from image tables using multi-modal models. By developing a benchmark and demonstrating the feasibility of this task using proprietary models, we aim to support future work on open-source schema-driven IE models

arXiv.org e-Print Archive

Requirements Analysis for an Open Research Knowledge Graph

Author: Auer Sören
Brack Arthur
Ewerth Ralph
Hoppe Anett
Stocker Markus
Publication venue: Berlin ; Heidelberg : Springer
Publication date: 01/01/2020
Field of study

Current science communication has a number of drawbacks and bottlenecks which have been subject of discussion lately: Among others, the rising number of published articles makes it nearly impossible to get a full overview of the state of the art in a certain field, or reproducibility is hampered by fixed-length, document-based publications which normally cannot cover all details of a research work. Recently, several initiatives have proposed knowledge graphs (KGs) for organising scientific information as a solution to many of the current issues. The focus of these proposals is, however, usually restricted to very specific use cases. In this paper, we aim to transcend this limited perspective by presenting a comprehensive analysis of requirements for an Open Research Knowledge Graph (ORKG) by (a) collecting daily core tasks of a scientist, (b) establishing their consequential requirements for a KG-based system, (c) identifying overlaps and specificities, and their coverage in current solutions. As a result, we map necessary and desirable requirements for successful KG-based science communication, derive implications and outline possible solutions

Repositorium für Naturwissenschaften und Technik

Requirements Analysis for an Open Research Knowledge Graph

Author: A Brack
A Constantin
A Fink
A Hars
A Hoppe
AR Hevner
C Lange
C Okoli
D Vrandečić
G Petasis
IA Klampanos
J Beel
J Lehmann
K Balog
K Degtyarenko
L Bornmann
LN Soldatova
M Färber
M Liakata
M Lubani
M Stocker
MAMA Harris
MY Jaradeh
O Bodenreider
R Braun
S Fathalla
S Mesbah
S Peroni
S Vahdati
V Pertsas
Z Nasar
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Current science communication has a number of drawbacks and bottlenecks which have been subject of discussion lately: Among others, the rising number of published articles makes it nearly impossible to get an overview of the state of the art in a certain field, or reproducibility is hampered by fixed-length, document-based publications which normally cannot cover all details of a research work. Recently, several initiatives have proposed knowledge graphs (KGs) for organising scientific information as a solution to many of the current issues. The focus of these proposals is, however, usually restricted to very specific use cases. In this paper, we aim to transcend this limited perspective by presenting a comprehensive analysis of requirements for an Open Research Knowledge Graph (ORKG) by (a) collecting daily core tasks of a scientist, (b) establishing their consequential requirements for a KG-based system, (c) identifying overlaps and specificities, and their coverage in current solutions. As a result, we map necessary and desirable requirements for successful KG-based science communication, derive implications and outline possible solutions.Comment: Accepted for publishing in 24th International Conference on Theory and Practice of Digital Libraries, TPDL 202

arXiv.org e-Print Archive

Repositorium für Naturwissenschaften und Technik

The Automatic Detection of Dataset Names in Scientific Articles

Author: Heddes J.
Marx M.
Meerdink P.
Pieters M.
Publication venue: 'MDPI AG'
Publication date: 01/08/2021
Field of study

We study the task of recognizing named datasets in scientific articles as a Named Entity Recognition (NER) problem. Noticing that available annotated datasets were not adequate for our goals, we annotated 6000 sentences extracted from four major AI conferences, with roughly half of them containing one or more named datasets. A distinguishing feature of this set is the many sentences using enumerations, conjunctions and ellipses, resulting in long BI+ tag sequences. On all measures, the SciBERT NER tagger performed best and most robustly. Our baseline rule based tagger performed remarkably well and better than several state-of-the-art methods. The gold standard dataset, with links and offsets from each sentence to the (open access available) articles together with the annotation guidelines and all code used in the experiments, is available on GitHub

Directory of Open Access Journals

UvA-DARE